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GENERAL  REVIEW  OF  MILITARY  APPL ICATlONb  OF  VOICE  PRUCESbING 


DR.  BRONO  BEEK 
MR.  RICHARD  S.  VONUSA 


Rome  Air  Development  Center 


Griffiss  Air  Force  Base  NY  13441 


SUMMARY 


This  paper  introduces  the  subject  ot  Voi  ce - Inte rac t i ve  Systems  and  their  role  in 
military  applications.  The  history  and  evolution  of  automatic  speech  recognition  and 
synthesis  is  briefly  explored  and  the  current  state-of-the-art  is  reviewed.  The  term 
“ Voi  ce - Interact i ve  Systems"  is  defined  and  the  advantages  and  disadvantages  of 
Voi ce - Interact i ve  Systems  are  highlighted.  Next,  are  previous  applications  of  speech 
systems  to  military  problems  is  summarized,  the  major  application  areas  are  described 
ana  current  development  projects  in  the  US  and  other  NATO  countries  are  presented. 
Special  attention  is  focused  on  the  cockpit  application.  Several  projects  in  this  area 
are  discussed  along  with  a  summary  of  important  issues  to  consider  when  applying 
Voi ce - Interact i ve  Systems  to  the  aircraft  environment. 

1.  intkuuultiun 

The  oojective  of  this  paper  is  to  provide  a  broad  perspective  on  the  topic  of 
Voi ce - Interact i ve  Systems,  and  in  particular  the  use  of  these  systems  for  cockpit  and 
other  military  applications.  Before  the  actual  application  of  these  systems  is 
discussed,  it  would  be  helpful  to  consider  such  questions  as:  “How  did  speech  systems 
evolve?",  "What  techniques  are  used  to  do  speech  recognition  and  synthesis?",  and  "What 
are  the  major  problems  in  doing  recognition  and  synthesis?"  By  answering  these 
questions  it  is  hoped  the  reader  can  gain  an  understanding  of  the  current  capabilities 
and  limitations  of  automated  speech  systems.  With  this  background,  the  issues  of 
applying  speech  systems  to  military  problems  is  undertaken.  Because  the  use  of  speech 
systems  is  now  so  widespread  the  treatment  here  is  not  exhaustive,  but  instead  is 
intended  to  be  representative  of  much  of  the  current  research.  This  paper  is  a  general 
survey  and  cannot  comprehensively  cover  all  of  the  topics  presented.  The  extensive 
bibliography  provides  references  for  more  detailed  information  for  the  interested 
reader. 

i.  What  are  Voi ce - Interact i ve  Systems? 

Potential  users  in  industry,  the  home  and  the  military  are  starting  to  become 
excited  about  the  possibilities  of  voice  interaction  with  machines.  Speech  technology 
has  recently  received  considerable  publicity  as  new  applications  are  discovered 
(Ooddington,  ti.  R.,  and  Schalk,  T.  B.,  lyBl;  Simmons,  E.  J.,  1979;  Levinson,  S.  E.  and 
Liberman,  M.  Y.,  1981).  In  this  section  the  history  of  speech  technology  is  traced 
from  the  pioneering  days  in  the  fifties  to  the  present,  and  the  advantages  and 
disadvantages  of  speech  are  discussed.  Prior  to  this  discussion,  it  would  be  useful  to 
review  some  terminology.  Figure  1  lists  some  of  the  terms  presently  used  in  voice 
processing  technology.  Voi ce  -  Interacti ve  systems  includes  both  Automatic  Speech 
Recognition  (ASR)  and  systems  tor  speech  synthesis.  "Speech  Recognition"  of  a  human 
speaker  who  utters  single  words  or  short  sequences  of  words.  "Speech  Synthesis", 
sometimes  called  voice  responses  or  voice  output,  refers  to  a  machine  which  can 
generate  a  human  or  human-like  vocal  response.  “Speech  Understanding"  is  sometimes 
used  to  refer  to  machines  which  can  not  only  recognize  complete  sentences  (as  opposed 
to  a  single  word  or  short  sequence  of  words)  but  somehow  interpret  the  meaning  of  the 
sentence  as  well.  “Speaker  identification"  (or  speaker  recognition)  is  used  to 
describe  a  machine  which  has  the  ability  to  determine  who  is  speaking,  rather  than  what 
is  said.  There  are  other  useful  actions  based  on  speech  that  can  be  performed  by 
machines  including:  the  capability  to  recognize  what  language  is  being  spoken 
(language  identification),  the  ability  to  remove  noise  or  interference  from  speech 
(speech  enhancement),  the  capability  to  compress  the  bandwidth  necessary  to  transmit 
digital  speech  (vocoders),  the  capability  to  detect  abnormal  physical  or  psychological 
conditions  of  the  speaker  (stress  analysis),  and  others.  The  useful  functions  being 
performed  by  these  systems  are  accomplished  through  the  theory  and  practice  of  speech 
processing  technology.  How  this  technology  has  evolved  is  the  topic  of  the  next 
sect i on . 

i .  1  The  Evolution  of  Speech  Processing  Technology 

Speech  processing  technology  has  been  based,  to  a  large  degree,  on  early  research 
in  such  areas  as  experimental  phonetics,  the  pnysiology  ot  the  human  vocal  apparatus 
and  auditory  system,  human  perception  of  speech,  and  especially  acoustics,  and 
phonetics.  This  basic  research  provided  much  ot  the  fundamental  knowledge  which  is 
required  to  some  extent  in  almost  all  speech  processing  systems.  Une  of  the  first 


applications  of  this  knowledge  to  speech  processing  was  reported  in  1952  (Davis,  K.  H., 
et  al,  1952)  with  the  demonstration  of  a  successful  speech  recognition  system  which 
could  recognize  the  digits  spoken  from  one  talker.  One  of  the  first  electronic  speech 
synthesis  systems  was  produced  even  earlier,  in  1939,  by  researchers  at  Bell  Labs 
(Dudley,  H.;  Riesz,  K.  R.;  Watkins,  S.  A.;  1939).  This  system  was  called  the  Voder  and 
was  also  a  forerunner  of  the  modern  vocoder. 

Important  milestones  in  the  development  of  speech  processing  technology  occurred  in 
1956  and  1959,  with  the  first  efforts  to  incorporate  linguistic  information  (Wrien,  J. 
and  Stubbs,  H.  L.;  1956)  and  tne  use  of  a  general  digital  computer  (Forgie,  J.  W.  and 
Forgie,  L .  D.;  1959)  respectively.  One  of  the  first  speech  recognition  systems  which 
could  recognize  continuous  speech  was  developed  in  1969  and  could  accommodate  a  highly 
constrained  vocabulary  of  16  words  (Vicens,  P.  J.;  1969).  Much  of  this  early  work 
assumed  all  of  the  information  required  to  do  recognition  could  be  extracted  from  the 
spectral  envelope  of  the  acoustic  speech  wave.  This  resulted  in  the  development  of 
many  approaches  for  spectral  analysis  of  speech,  and  with  these  analysis  approaches 
came  very  s opn i s t i ca t ed  mathematical  techniques  for  manipulating  the  acoustic  speech 
parameters.  Some  of  these  mathematical  techniques  are:  linear  predictive  coding 
^  LRC  )  ,  dynamic  programming,  the  Fast  Fourier  Transform  (FFT),  and  homomorphic 
filtering,  among  otners.  Speech  recognition  techniques  which  are  concerned  solely  with 
the  manipulation  of  the  acoustic  waveform  are  sometimes  referred  to  as  mathematical, 
ijdttern  matching  or  statistical  approaches.  However,  an  alternate  approach  was  soon  to 
receive  considerable  attention. 

Prior  to  1970  most  of  the  work  in  ASR  was  concerned  with  recognizers  which  could 
only  aea!  with  a  very  limited  vocabulary  (typically  5U  words  or  less),  spoken  in  a 
discrete  manner,  for  a  talker  who  had  previously  used  the  device.  In  fact,  many  of 
these  recognizers  worked  with  high  accuracy  in  the  laboratory,  and  a  group  of 
researchers  at  RCA  were  encouraged  enough  to  leave  and  form  their  own  ASR  company. 
Threshold  Technology  Inc.,  in  1970.  However,  the  field  of  speech  recognition  was 
sharply  criticized  in  a  letter  written  by  John  Pierce,  a  highly  respected  scientist  at 
Bell  Laboratories  (Pierce,  J.;  1969).  The  letter  accused  researchers  working  in  speech 
recognition  of  failing  to  appreciate  the  difficulty  of  their  task.  Although  the  letter 
seemed  to  put  a  temporary  damper  on  the  enthusiasm  for  speech  recognition,  the  Advanced 
kesearch  Projects  Agency  (ARPA)  nonetheless  funded  a  large,  5  year  effort  in  the  field 
in  1971.  The  ARPA  project  addressed  the  problem  of  speech  understanding,  rather  than 
speech  recognition,  and  had  a  number  of  ambitious  technical  goals.  There  is 
considerable  debate  even  now  as  to  the  progress  made  in  the  project  (Klatt, 
D.  H.,  1977;  and  Neuberg,  E.  P.,  1975).  What  is  noteworthy  is  that  the  approach  taken 
in  the  project  can  be  considered  from  the  perspective  of  artificial  intelligence  (Al). 
Unlike  tne  mathematical  approach,  Al  presumes  that  a  perfect  extraction  of  phonetic 
features  in  speech  is  not  necessary  (or  maybe  even  possible)  because  errors  made  in 
this  extraction  phase  can  be  compensated  by  knowledge  obtained  from  so-called  "higher 
sources."  This  higher  order  knowledge  includes  syntax,  semantics  and  the  pragmatics  of 
discourse.  It  remains  to  be  seen  which  approach  will  be  more  successful.  Perhaps  a 
combination  of  approaches,  along  with  greater  computing  power,  will  solve  many 
problems.  Most  agree  that  increasing  the  knowledge  of  the  human  speech  process  is 
required  before  the  effectiveness  of  speech  systems  match  the  expectations  of  potential 
users  . 

The  last  half  of  the  seventies  has  seen  increased  attention  given  to  the 
application  of  A5R  technology  to  a  variety  of  real-world  problems,  with  less  emphasis 
being  given  to  more  fundamental  research.  There  are  still  many  of  these  fundamental 
research  problems  whicn  remain  to  be  solved,  as  shall  be  discussed  below.  A  number  of 
companies,  large  and  small,  now  have  speech  recognition  products  commercially 
available.  An  increasing  number  of  speech  synthesis  products  are  also  becoming 
available  in  the  marketplace. 

The  state-of-the-art  in  speech  recognition  and  synthesis  will  now  be  addressed. 
Practical  ASR  is  restricted  to  d i sc  re te -ut te ranee  ,  limited  vocabulary,  and  speaker 
dependent  recognition  of  high  quality  speech.  The  accuracy  of  such  systems  are 
aependent  on  a  variety  of  factors  but  accuracies  near  100%  in  the  laboratory  and  less 
than  90%  in  field  tests  are  typical.  Synthesis  systems  are  generally  of  three  types: 
1.  Those  which  do  a  simple  encoding  of  the  speech  signal.  An  example  is  a  simple 
digitization  of  real  speech.  The  synthesis  would  then  be  accomplished  by 
d i gi t a  1 -to -a n a  1 og  conversion;  2.  A  complex  encoding  of  the  speech  signal.  An  example 
of  this  type  is  linear  predictive  coding.  Speech  is  encoded  and  then  stored  to  form 
pre-recorded  messages  which  are  synthesized  by  doing  the  inverse  of  the  encoding 
process;  3.  Sy n t he s i s -by - r u I e  systems  which  require  very  little  storage  of  actual 
speech,  but  instead  accept  as  input  typed  commands.  The  commands  are  interpreted  and  a 
basic  set  of  speech  sounds  (phonemes)  are  strung  together  and  modified  by  a  complex 
sequence  of  rules.  There  are  three  main  trade-offs  associated  with  speech  synthesis 
systems.  These  are:  speech  quality,  memory  requirements  and  message  flexibility.  The 
chart  below  summarizes  these  trade-offs  for  the  three  types  of  synthesis: 


SYNTHESIS  TECHNIQUES 

QUALITY 

MEMURY 

FLEXIBILITY 

1  . 

Simple  Encoding 

High 

Greatest 

Moderate 

2. 

Comp  lex  Encoding 

Moderate 

Moderate 

L  ow 

3  . 

Synthesis-By-Rule 

Low 

Low 

High 

Currently,  there  are  more  than  44  companies  producing  speech  synthesis  products 
IWong,  0.,  1981).  Excellent  summaries  of  speech  processing  technology  evolution  can  be 
round  among  the  references  (Denes,  P.  8.  1975;  Hyde,  S.  R.,  1972;  Reddy,  D.  R.,  1976; 
Lea,  W.  A.,  1979). 

2.2  Voi ce - Interact i ve  System  Defined 

The  previous  discussion  highlighted  speech  technology,  not  voice-interactive 
systems.  The  term  "voice-interactive  system"  emphasizes  the  interfacing  of  a  human  and 
a  machine  that  is  of  interest.  The  “voice"  part  of  a  voice-interactive  system  can  mean 
either  a  human  voice  talking  to  a  machine,  or  vice  versa.  Since  a  human  is  involved, 
it  is  not  only  speech  technology  that  is  of  concern,  but  the  psychology  and  physiology 
of  the  man-machine  interaction.  Researchers  involved  with  speech  processing  are 
typically  electrical  engineers,  computer  scientists  or  mathematicians.  Those  involved 
with  voi ce-i n teract i ve  systems  have  a  more  behavioral  science  orientation,  and  include 
experimental  psychologists  and  human  factors  engineers. 

What,  then,  is  meant  by  the  term  “voice-interactive  system"?  Conceivably  it  could 
mean  any  system  involving  humans  and  machines,  with  speech  as  the  mode  of 
communications.  Thus,  a  digital  voice  communications  system  could  qualify  as  a 
voice-interactive  system  under  this  definition.  However,  this  is  not  what  is  usually 
meant  by  the  term. 

A  Voi ce- Inte rac t i ve  System  is  defined  as  the  interface  between  a  cooperative  human 
and  a  machine,  which  involves  the  recognition,  understanding  or  synthesis  of  speech,  to 
accomplish  a  task  of  command,  control  or  communications,  and  which  involves  feedback 
from  the  listener  to  the  speaker.  With  this  definition,  the  digital  communications 
system  no  longer  qualifies,  because  the  system  provides  an  interface  between  a  human 
and  another  human,  not  a  machine.  Likewise,  speaker  identification  and  language 
identification  systems  do  not  qualify  as  Voi ce - Interacti ve  Systems  because  they  involve 
non-cooperative  speakers  and  no  feedback  from  the  speaker  (human)  and  the  listener 
(machine).  Figure  2  shows  in  a  very  simple  way  a  voi ce-i nteract i ve  system:  The  box 
labeled  "Speech  I/O  Subsystem"  is  some  type  of  speech  processing  technology.  Suppose 
the  voice-interactive  system  was  one  in  which  a  human  pilot  can  control  certain  cockpit 
functions,  and  in  addition  can  receive  audio  warning  messages.  The  diagram  of  Fig.  2 
can  then  be  drawn  more  specifically  as  shown  in  Fig.  3.  The  interaction  diagrammed  in 
Fig.  3  is  fairly  complex  and  is  intended  to  show  relationships  among  all  the  elements 
involved,  not  any  particular  system.  The  pilot  and  speech  I/O  sub-system  are  both 
listeners  and  speakers.  The  pilot  controls  certain  cockpit  functions  (which  have  not 
been  specified)  by  speaking  utterances  into  an  ASR  device.  The  controller  of  the 
voice-interactive  system  interprets  the  results  of  the  recognition  and  responds  with 
the  appropriate  controlling  actions  to  the  aircraft.  Suppose  an  emergency  situation 
arose  of  which  the  pilot  was  unaware.  Presumably,  the  aircraft  would  signal  this  to 
the  controller  which  would  respond  with  the  appropriate  synthesized  warning  message. 
The  pilot  would  then  take  corrective  action  which  may  require  him  to  use  manual 
controls,  and  the  warning  message  would  be  subsequently  halted.  In  this  case  there  is 
feedback  in  both  directions  between  man  and  machine. 

2.3  Advantages  of  Speech  Communications 

There  are  good  reasons  why  people  might  wish  to  use  speech  to  communicate  with 
machines  and  many  reports  have  detailed  the  relative  advantages  of  speech 
communications  (Lea,  W.  A.,  1968;  Lea,  W.  A.,  1979;  Martin,  T.  B.,  1976).  However, 
there  is  relatively  little  empirical  evidence  which  demonstrates  the  value  of  speech 
over  other  mooes  of  communications,  command  or  control.  What  empirical  evidence  does 
exist  seems  encouraging.  In  a  famous  experimental  run  at  Johns  Hopkins  University, 
teams  of  people  interacting  together  to  solve  problems  solved  them  much  faster  using 
voice  when  contrasted  to  other  modes  of  communication  (Ochsman,  R.  B.  and  Chapanis, 
A.,  1974).  Other  studies  indicating  the  advantage  in  terms  of  speed  and  accuracy  of 
voice  over  other  modes  of  communications  for  certain  tasks  have  been  reported.  (Welch, 
J.  R.,  1977;  Harris  S.,  Owens  J.,  and  North  R.,  1979;  Skriver,  C.,  1979;  Wherry, 
R.  ,  1  974  ;  Poock  ,  G.  K..,  1980).  Despite  the  lack  of  supporting  data,  a  list  of 
advantages  shall  be  presented  for  speech  communication  in  general,  and  in  a  later 
section  for  the  application  of  speech  in  the  cockpit  environment.  Many  of  these 
arguments  for  speech  are  of  the  "common  sense"  variety  and  there  are  undoubtedly  others 
that  could  be  added  to  them. 

The  most  powerful  reason  for  using  speech  is  the  fact  that  it  is  man's  most  natural 
form  of  communications  and  does  not  require  special  training  to  learn.  A  second  strong 
argument  tor  voice  is  that  it  frees  the  hands  and  eyes  for  other  tasks.  Most  of  the 
other  advantages  follow  directly  from  these  two.  Figure  4  shows  a  list  of  advantages 
of  speech  communication.  The  list  has  been  divided  into  three  sections:  engineering, 
psychological  and  physiological. 

2.4  Ui sadvanta ges  of  ipeech  Communications 

The  di sadvantages  of  speech  communications  should  be  considered  carefully.  It  is 
Important  to  make  a  distinction  between  the  drawbacks  of  speech  communications  in 
yeneral  and  the  limitations  of  current  speech  technology.  The  former  is  relevant  in 
speculating  about  the  long-range  possibilities  of  speech,  and  the  latter  is  relevant  to 


near-term  concerns.  The  general  disadvantages  of  speech  communications  are  shown  in 
Fig.  b,  and  those  associated  with  cockpit  environments  are  discussed  in  a  later 
section.  The  disadvantages  of  speech  communications  involve  mainly  the  effects  of  a 
hostile  environment  on  the  speech  signal  directly,  or  indirectly  by  a  change  of  the 
physical  or  emotional  state  of  the  speaker. 

j.u  Issues  of  cockpit  Applications  of  Voi ce  -  1  nte rac t i ve  Systems 

Ine  use  of  voice-interactive  systems  offers  the  potential  for  solving  critical 
ma  n -mac  nine  problems  in  the  aircraft  cockpit.  These  problems  are  severe  in  military 
aircraft,  ana  especially  in  aircraft  capable  of  high  performance.  In  these  aircraft, 
crew  members  are  often  forced  to  cope  with  a  very  high  workload,  caused  by  inefficient 
crew  member  stations,  poor  assignment  of  operator  tasks,  and  an  overwhelming  number  of 
>1 1 s , I  a. s  a  n  o  indicators.  In  summary,  the  human  operator  is  overwhelmed  with  too  much 
i r tor  mat  ion  and  nas  too  many  v  l  s u a  1  / ma n u a  1  tasks  to  perform.  There  has  been  recent 
attei'tui  _  'deed  on  using  v  o  i  c  e  - 1  n  t  e  r  a  c  t  i  v  e  systems  in  the  cockpit  to  reduce  the 
,  ,■  r a t  i  wuiviuao  problem  and  solve  other  man-machine  problems.  All  three  services  in 
•  e  .’iieo  states  .  nany  NA 1  u  countries  and  considerable  industrial  effort  has  addressed 
a ,  ,  .  i  v  a  t  i  u  n  e  r  speech  technology  (Lane,  N .  E.,  and  Flarris,  S .  D.,  1980;  Coler, 
!,  i  '■  ,  k  .  A.,  et  a  1  ,  1980;  Wicker,  J.  t.  ,  1980;  Flarris,  S.  ,  et  al,  1980; 
•  i  ip  ;  ,  heed,  1.,  1981).  Because  this  is  a  tutorial  paper,  the  discussion 

v  .  •'  a  i ,  of  the  relevant  issues  and  not  a  detailed  technical 


,  .  .  j'.  ..  i  sad  vantages  of  Voice  Interactive  Systems  in  the  Cockpit 

-  •.  u a :  ■>  future  nigh  performance  attack/fighter  aircraft  may  exceed 
i.  •  ws  interface  c a  pa b i I i 1 1 e s .  Hence,  a  real-time  voice  interactive 
.'eriiai  solution  as  a  metnod  of  augmenting  current  control/display 

i  i  e  s  4' Fit  of  tnis  rea  I  l  2  a  1 1  on  ,  a  number  of  airframe  manufacturers  have 

.•  •  .aii.  ns  and  experimental  design  in  interactive  voice  c omma nd /f eed back 

,  .eat  r  :  inter  aircraft. 

•  .  .  e » >  e '  l  e  n  c  e  o  military  pilots  were  questioned  regarding  the  idea  of 

•  t.ji  fighter  aircraft.  In  one  study,  (Ruth,  J.  C.  et.  al., 

were  as«ed  tu  rate,  on  a  scale  of  zero  to  ten,  the  use  of  voice  command 

ini  nign  performance  military  aircraft;  zero,  of  course,  represented 
•  •  r-e  «i  .e  idea"  ana,  ten  represented  "sounds  great."  These  pilots  were  opposed 

.,i  ,■  .  vii.Ni.d  lor  important  decision  making  functions  such  as  firing  weapons, 
.  a '  u ,  .  ,  t  and  control  trim.  However,  they  were  in  favor  of  mode  selection 

iuis  sauii  as  radio  channel  selection,  TACAN  ILS,  radar,  bomb/NAV  mode 
tip'  a  <  IH  ,  transponder  setup.  Some  early  results  indicate  that  voice  command 
•pi.tt.'s  ac  :  oe  directly  substituted  for  con t ro 1 /d i sp 1  ay  interfaces  in  a  fighter 
aircraft. 

1 1,  addition  to  the  above,  the  military  aircraft  environment  levies  a  number  of 
additional  requirements  on  automatic  recognition  subsystems  (voice  recognition 
uevices;.  some  of  tnese  requirements  are  listed  in  Table  1. 


TABLE  1 

Oxygen  Mask  High  Ambient  Background  Noise 

Microphone  Preselected  Vocabulary 

Physical  Stress  Non-robust  words 

tmotional  Stress  Human  Error 

Vibrational  Effects  Overall  reliability 

Complexity  Cost,  Size/Weight 

Syntax 

The  requirements  listed  in  Table  1  cause  specific  technical  problems  surh  as  word 
boundary  detection,  memory  requirements,  small  space,  noise  stripping  and  voice 
inconsistencies. 

Ouestions  arise  as  to  whether  training  should  be  done  with  pilots  wearing  oxygen 
masks  under  actual  flight  conditions  (different  0-levels,  engine  power  levels,  canopy 
on-off).  Types  yf  signal  input  to  system  must  include  the  affects  of  regulation, 
inhaling,  exhaling,  etc.  Results  have  shown  that  because  breath  and  background  noise 
cause  drop  offs  at  the  ends  of  words,  an  end  point  detector  ba-ea  on  energy  level  can't 
be  used,  hence  more  sophisticated  automated  end  point  detection  is  required. 

i.'i  Speech  Synthesis  in  the  cockpit 

Military  aircraft  appl. cations  of  speech  synthesis  systems  have  also  been 
investigated  especially  for  caution  and  warning  messages.  Some  of  these  applications 
are  listed  in  Table  'i . 


Table  2 


Applications  of  Speech  Synthesis  Systems 

Voice  Warning 
Time-to-Go  Countdown 
Fault  List  Feedback 
Way  Point  Announcement 

Voice  Response  for  Specific  Request  System  Data 
Audio  Feedback  to  Voice  Commands 

A  few  problem  areas  resulting  from  speech  synthesis  in  a  high  performance  aircraft  are 
specific  message  selection  and  corresponding  voice  quality.  These  synthesis  systems 
must  be  aware  of  pilot  safety,  sound  level  variation  for  different  noise  levels, 
potential  inter fere nee  with  other  audio  communications,  cognitive  and  attentional 
demands. 

It  can  be  said  there  is  general  agreement  that  voice  command,  control  and  synthesis 
systems  can  provide  the  military  aircrews  with  a  useful  adjunct  to  conventional  control 
and  display  interfaces  and  provide  warning  and  status  data  via  speech  synthesis. 
However,  in  order  to  apply  this  technology,  many  behavioral  and  human  factors  problems 
must  be  solved  as  well  as  some  very  difficult  technical  speech  recognition  issues. 
Clearly,  the  question  of  complexity  and  overall  reliability  of  a  voice  interactive 
system  in  an  aircraft  environment  must  be  addressed. 

A.  Other  Military  Applications  of  Voice  Processing  Systems. 

(beek,  B.,  et.al.,  1977,  1978,  1982) 

1.  Digital  Narrowband  Communications  Systems. 

Air  Force  tactical  communications  are  being  required  to  operate  in  increasingly 
difficult  and  hostile  situations.  Requirements  are  being  levied  on  spread  spectrum 
communications  systems  to  provide  increased  communications  capacity,  multiple  access, 
and  tactical  conferencing.  Higher  degrees  of  jam  resistance  and  a  lower  probability  of 
intercept  are  required  in  the  already  overcrowded,  dynamic  channel,  and  rapidly 
changing  signal  and  interference  environment.  (See  Fig.  6) 

These  increased  requirements  stress  using  existing  frequency  hopped,  pseudo  noise, 
voice  FH/PN  spread  spectrum  communications  systems  and  ordinary  HF  and  VHF  radio 
systems.  Systems  being  considered  in  exploratory  development  address  these  demands 
aggressively  with  a  combination  of  Speech  Processing,  Adaptive  Speech  Processing, 
Adaptive  Signal  Processing,  and  VHSIC  type  mi  croc i rcui t ry  . 

Presently,  the  standard  method  of  voice  digitization  being  used  is  16  kilobits 
CVSD.  For  advanced  applications  2400  bit/sec  LPC  based  systems  are  completing  the 
developmental  process.  The  2400  bit/sec  LPC  system  provides  a  factor  of  6.66  reduction 
in  input  data  rates  which  of  itself  would  allow  that  many  more  channels  in  a 

communications  bandwidth  or  a  factor  of  8  db  increase  in  processing  gain. 

Recent  research  has  produced  sufficiently  intelligible  demonstrations  of  advanced 
exploratory  techniques  which  are  capable  of  digitizing  continuous  speech  and  retaining 
a  degree  of  speaker  recognition  at  rates  down  to  400  bits/sec.  An  additional  factor  of 
6  increase  over  “standard"  LPC  is  obtained  providing  that  many  more  channels  or  another 
8  db  increase  in  processing  gain.  Total  gain  that  can  be  achieved  here  is  36 

additional  channels  or  16  db  increase  in  processing  gain  for  the  potential  AJ  systems. 

Isolated  word  recognition  can  further  reduce  the  transmission  rate  to  about  80 
bits/sec  for  the  limited  vocabulary  case.  Basic  research  is  presently  underway 
applying  artificial  intelligence  methods  to  achieve  continuous  speech  recognition  with 
less  limited  vocabularies.  Fig  7  shows  a  simplified  listing  of  the  state-of-the-art  of 
voice  digitization  system  and  limitations.  Another  factor  of  5  is  obtained  here,  with 
a  corresponding  7  db  increase  in  processing  gain.  Total  gain  that  can  be  achieved  is 
180  additional  channels  or  a  processing  gain  advantage  of  22  db. 

Unfortunately,  it  becomes  correspondingly  more  difficult  to  realize  these  gains  in 
practice.  For  example,  the  total  delay  iheurred  in  going  through  the  speech 
processor/synthesi zer  combination  grows  as  the  degree  of  sophistication  of  the 

processor  increases.  For  the  intermediate  state  of  voice  compression  the  challenge  is 
to  achieve  delays  below  100  msec.  For  word  recognition  systems,  delays  of  up  to  250 
msec,  are  not  acceptable.  Additionally,  the  longer  integration  times  required  for  the 
longer  bit  times  and  the  higher  anti-jam  requirements  in  many  cases  exceed  the 
coherency  of  the  channel.  Electromagnetic  compatibility  demands  consideration  of 
shorter  transmissions  implying  the  use  of  lower  duty  cycle  transmissions.  The 

resulting  loss  in  transmitted  energy  per  bit  requires  an  increase  in  the  peak  power  of 
the  transmitted  signal  not  desirable  for  low  detectability  considerations.  The 
Incoherency  of  the  channel  taken  together  with  the  shorter  pulse  times  will  necessitate 
the  use  of  incoherent  combining  techniques  Incurring  additional  losses.  Finally, 
cockpit  noise  and  speech  distortion  can  increase  the  difficulty  in  successfully 
digitizing  the  speech  information. 


Fortunately ,  the  application  of  advanced  signal  processing  techniques  can  minimize 
the  losses  incurred.  Adaptive  signal  processing  and  signal  encoding  techniques  are 
being  applied  by  current  RADC  development  programs  to  achieve  even  more  jam  resistance 
and  interference  suppression  than  can  be  achieved  by  input  data  compression  .echniques 
taken  alone.  Noise  suppression  and  speech  analysis  efforts  are  showing  great  promise 
in  solving  the  practical  cockpit  speech  input  problem.  Additionally,  basic  research  is 
being  conducted  to  make  practical  the  use  of  word  recognition  techniques  for 
applications  where  the  speech  processing  delay  does  not  pose  an  unacceptable  factor  in 
the  communications  system  design. 

The  RADC  in-house  program  (HF  Terminal  with  ECCM  Modem,  Speech 
Recogn i t i on /Synthes i s  )  demonstrated  a  combination  of  techniques  which  provide  anti-jam 
( A  J )  voice  communication  over  radio  channels  whose  bandwidth  ordinarily  supports  only 
conventional  non-AJ  voice  (Beek,  B.,  1982).  Moreover,  this  combination  also  provides 
enhanced  .eliability  under  noisy  (but  unjammed)  channel  conditions.  The  voice  source 
encoding  employs  special  codes  to  represent  phrases  and  in  some  cases,  sentences,  and 
thus  provides  a  certain  significant  amount  of  data  compression.  This  type  of  system 
will  narrow  the  bandwidth  requirements  for  voice  communication  to  approximately  80  Hz 
and  will  provide  15.7db  anti-jam  margin.  This  compares  very  favorably  with  analog 
systems  that  require  a  bandwidth  of  3000  Hz  and  has  no  anti-jam  margin.  Low  data  rate 
systems  have  the  disadvantages  of  vocabulary  size  restriction,  word  rate  restrictions, 
and  loss  of  speaker  identity,  but  the  advantage  of  increased  intelligibility  may 
outweigh  the  disadvantages  for  certain  applications.  As  connected  speech  recognition 
systems  are  developed,  vocabulary  size  and  word  rate  restrictions  can  be  minimized. 

b.O  Automatic  Speaker  Ver i f i cat i on / Ident i f i ca t i on 

Speaker  Verification.  The  objective  of  this  program  is  to  develop  automated 
methods  of  identity  verification  for  the  purpose  of  providing  controlled  access  to 
secure  areas.  (See  Fig.  9)  For  many  years,  RADC  has  supported  the  development  of  a 
method  of  entry  control  using  speech  as  the  personal  attribute.  The  Automatic  Speaker 
Verification  ( A S V )  System  has  proved  to  be  highly  reliable  (over  99%  accurate)  at 
verifying  individuals'  identity  and  detecting  imposters. 

An  Advanced  Development  Automatic  Speaker  Verification  System  was  fabricated, 
tested,  and  evaluated  for  entry  control  using  a  person's  voice  as  a  personal  attribute 
for  secure  access  control.  Under  this  effort,  algorithms  were  implemented  on  three  T1 
900  minicomputers,  which  were  operationally  tested  for  six  months  at  the  entrance  of 
the  Semiconductor  building  at  Texas  Instruments,  Dallas  Texas.  A  total  of  28b  users 
(2D0  men  and  86  women)  provided  13,639  accesses.  A  Type  I  error  rate  (true  speaker 
rejection)  of  less  than  1.0%  was  achieved.  Off-line  tests  on  casual  impostors  provided 
a  Type  II  error  rate  (impostor  acceptance)  of  less  than  1.0%  with  a  confidence  level 
yreater  than  90  percent. 

A  study  of  speakers  using  an  LPC-based  prediction  residual  was  also  investigated 
under  this  effort.  This  study  provided  a  magnitude  of  improvement  in  performance  which 
exceeds  the  goals  of  this  effort.  Future  work  in  this  area  is  to  implement  an 
LPC-based  speaker  verification  system. 

Speaker  Identification.  This  problem  is  similar  to  the  speaker  verification 
problem  except  no  prior  identity  claim  is  made  by  the  unknown  speaker.  Speaker 
identification  is  the  harder  problem  for  several  reasons  (See  Fig.  10): 

a.  The  speaker  may  be  uncooperative; 

b.  The  quality  of  the  communications  channel  may  be  poor; 

c.  There  is  no  control  over  the  spoken  text  by  the  communications  analyst; 

d.  The  unknown  speaker  may  or  may  not  be  a  member  of  an  original  set  of 
speakers;  and 

e.  The  recording  and/or  channel  conditions  may  be  different  for  speech 
collected  tor  reference  and  test  samples. 

An  exploratory  development  program  to  do  speaker  identification  was  recently 
concluded  (See  Fig.  11).  The  goals  of  the  effort  were  to  recognize  any  one  of  30 
unknown  male  talkers,  using  as  little  as  ten  seconds  of  reference  and  test  speech  data, 
in  real-time  as  shown  in  Fig.  12.  All  goals  of  the  effort  were  met  or  exceeded. 
These  encouraging  results  were  achieved  by  use  of  an  algorithm  originally  developed  by 
Markel  ,  which  uses  ten  Linear  Prediction  Codes  (LPC)  coefficients  that  are  averaged 
over  the  entire  recognition  period.  A  follow-on  effort  is  planned  which  will  attempt 
to  improve  human  factors  aspects  of  the  speaker  identifications  system  and  to  improve 
recognition  accuracy  under  noisy  (lOdb  or  less  SNR)  channel  conditions. 

6.  Speech  Enhancement 


The  use  of  Automatic  Speech  Recognition  (ASR)  to  relieve  flight  crew  workload  and 
to  provide  narrowband  communications  for  airborne  operations  is  highly  desirable. 
Unfortunately  no  ASR  system  exists  that  can  cope  with  the  harsh,  noisy  airborne 


environment.  Current  commercial  ASK  equipment  has  not  been  designed  to  operate  in  the 
airburne  environment.  For  this  reason  a  considerable  amount  of  attention  has  been 
given  to  reducing  the  effects  of  the  airborne  environment  on  ASR. 

There  are  many  environmental  effects  that  cause  poor  operation  of  an  ASR  system  in 
the  aircraft  environment.  Some  of  these  effects  are  aircraft  noise,  breathing  noise, 
operator  stress,  operator  fatigue,  effects  of  gravitational  forces  on  operator's 
speech,  etc.  Although  all  of  these  environmental  effects  must  be  reduced,  much 
attention  has  been  given  to  reducing  the  acoustic  noise  generated  by  the  aircraft.  The 
level  and  cha rac te r i s t i c s  of  this  noise  can  vary  considerably,  depending  on  such 
conditions  as  type  of  aircraft,  location  of  the  ASK  microphone,  facemask  or  no  mask 
operations,  and  status  of  aircraft. 

The  areas  of  concentration  in  reducing  the  effects  of  this  noise  have  been  in  the 
development  of  more  robust  recognition  algorithms  and  the  development  of  techniques  to 
reduce  the  acoustic  noise  before  recognition  processing  begins.  One  area  which  has 
generated  some  interest  for  removing  aircraft  noise  has  been  the  area  of  speech 
enhancement.  Some  of  the  problems  with  these  techniques  have  been  high  spectral 
distortion,  limited  noise  adaptation,  and  distortion  characteristics  that  vary  with 
input  signal  noise  level  and  spectral  shape. 

Rome  Air  Development  Center  (RAOC)  has  been  developing  speech  enhancement 
technology  to  improve  the  quality  and  intelligibility  of  speech  signals  that  are  masked 
and  interfered  with  by  communication  channel  noise.  RAUC's  interest  in  speech 
enhancement  is  not  only  in  improving  the  quality  and  intelligibility  of  speech  signals 
for  human  listening  and  understanding  but  to  improve  speech  signals  for  machine 
processing  as  well.  Speech  technology  such  as  speaker  identification,  language 
recognition  and  keyword  recognition  being  developed  by  RADC  requires  good  quality 
signals  in  order  to  provide  effective  results.  The  development  of  automatic,  real-time 
speech  enhancement  technology  is  therefore  of  high  interest  to  RADC.  This  technology 
is  required  to  improve  the  quality  of  degraded  speech  signals  to  an  acceptable  level 
for  these  systems. 

txploratory  development  work  at  RADC  has  led  to  the  development  of  an  Advanced 
Developmental  Model  enhancer  called  the  Speech  Enhancement  Unit  (SEU)  (See  Fig  13). 
This  unit,  which  uses  a  high  speed  digital  array  processor  in  conjunction  with  time, 
frequency  and  root-cepstral  algorithms,  provides  an  on-line,  real-time  capobility  to 
remove  frequently  encountered  communication  channel  interferences  wit:  minimum 
degradation  to  the  speech  signals.  The  types  of  interferences  or  noises  removed  can  be 
classed  into  three  groups;  (1)  impulse  noises  such  as  static  and  ignition  noise,  (2) 
narrowband  noise  which  includes  all  tone-like  noises,  and  (3)  wideband  random  noise 
such  as  atmospheric  and  receiver  electronic  noises.  Tests  have  shown  that  the  SEU  can 
reduce  all  of  these  types  of  noises  simultaneously  while  improving  both  the  quality  and 
the  intelligibility  of  the  speech  signal.  The  capability  to  remove  both  narrowband  and 
wideband  fandom  noise  without  degrading  the  quality  of  the  speech  signal  may  make  these 
speech  enhancement  techniques  applicable  to  improving  the  performance  of  Automatic 
Speech  Recognition  (ASR)  in  the  airborne  environment.  The  SEU's  ability  to  remove 
narrowband  types  of  noises  automatically  and  in  real-time  by  as  much  as  forty  (40) 
decibels  would  allow  the  removal  of  such  aircraft  noises  as  power  converter  hums, 
periodic  aircraft  vibrational  noises,  aircraft  compressor  noises,  and  other  rotational 
noises  associated  with  the  engine.  Since  the  noise  removal  process  causes  little 
distortion  to  the  speech  signal  and  removes  a  minimum  amount  of  the  speech  signal,  this 
spectral  noise  removal  process  should  remove  all  narrowband  noises  without  having 
detrimental  effects  on  the  recognition  accuracy  of  the  ASR  system. 

The  SEU's  ability  to  remove  wideband  random  noise  automatically  and  in  real-time 
may  allow  the  removal  of  much  of  the  unstationary  noise  generated  by  the  aircraft.  An 
example  of  the  noise  removal  process  is  shown  in  Fig.  14.  The  wideband  noise  removal 
process  is  a  root-cepstral  process  that  can  improve  the  si gnal -to-noi se  ratio  of  noisy 
communication  channels  as  much  as  12  to  14  decibels.  An  improvement  of  this  amount  in 
the  signal  received  at  the  input  of  an  ASR  system  could  improve  the  performance  of  an 
ASR  system  vastly. 

The  wideband  noise  removal  is  a  subtractive  process  that  is  accomplished  in  the 
spectrum  of  the  square  root  of  the  amplitude  spectrum.  While  this  function  is  not  the 
same  as  the  cepstrum  (the  cepstrum  is  the  spectrum  of  the  log  amplitude  spectrum),  it 
resembles  the  cepstrum  and  is  referred  to  as  the  root -cepstrum.  In  this  method  of 
noise  reduction  the  average  root -cepstrum  of  the  noise  in  the  input  signal  is  updated 
continually  and  subtracted  from  the  root-cepstrum  of  the  combined  speech  and  noise. 
Because  the  random  noise  concentrates  disproportionately  more  power  in  the  low  region 
of  the  root-cepstrum  than  does  the  speech,  the  subtracted  reconstructed  time  signal 
produces  an  enhanced  speech  signal. 

There  are  two  reasons  why  this  technique  of  wideband  noise  removal  is  encouraging 
for  the  successful  removal  of  aircraft  noise  for  ASR.  First  the  noise  removal 
technique  used  is  independent  of  the  spectral  shape  of  the  noise.  This  indicates  that 
the  enhancement  unit  should  theoretically  adjust  to  the  aircraft  noise.  The  second 
encouraging  reason  is  that  the  enhancement  transformation  used,  unlike  the  spectral 
subtraction  methods  which  can  cause  high  distortion,  causes  very  little  distortion  to 
the  speech  signal  which  is  important  to  the  recognition  accuracy  of  any  ASR  equipment. 


The  SEU's  capability  to  reduce  narrowband  and  wideband  noise  without  causing 
distortion  that  is  detrimental  to  the  human  listener  (see  Fig.  15)  may  be  used  to 
improve  the  recognition  accuracy  of  ASR  equipment  in  a  noisy  airborne  environment.  For 
this  reason  RADC  is  planning  a  series  of  carefully  controlled  tests.  The  tests  will 
utilize  two  speech  recognizers  in  conjunction  with  the  SEU.  The  effects  of  various 
types  of  noise  and  on  the  recognition  accuracy  of  these  ASK  systems  will  be  determined 
with  and  without  the  enhancer.  Preliminary  results  for  an  LPC  Based  Recognition  System 
a  re  shown  in  Fig.  16 . 

7.  Voice  Control  &  Data  Entry  Systems 

A  Voice  Data  Entry  (VDE)  system  was  designed  for  use  in  entering  voice  cartographic 
data  to  the  Digital  Landmass  System  ( DLMS )  data  base.  The  first  Voice  Data  Entry 
system  was  installed  at  the  Defense  Mapping  Agency  Hydrographic  Center  (DMAHC)  (See 
Fig.  17).  This  allowed  the  user  to  enter  depth  information  found  on  the  map  into  a 
computer  as  shown  in  Fig.  18.  This  information  was  sorted  along  with  the  map 
coordinates  of  the  particular  depth  readings.  The  vocabulary  for  this  study  included 
the  digits  plus  a  few  control  words.  Results  from  this  effort  showed,  for  a  limited 
vocabulary  scenario  where  the  operator  had  been  sufficiently  trained  in  system 
operation,  that  voice  data  entry  was  faster  than  a  manual  method  of  keyboard  entry  for 
both  a  skilled  and  unskilled  operator.  This  effort  also  revealed  an  indepth  study  of 
error  correction  procedures,  methods  of  system  training,  and  operator  f ami  1 i a r i zat  ion 
procedures  would  be  required  in  order  to  increase  the  efficiency  of  future  Voice  Data 
Entry  Systems. 

The  second  effort  was  the  design  and  testing  of  a  Voice  Data  Entry  (VDE)  system 
which  would  serve  to  input  cartographic  data  to  a  computer.  The  system  was  installed 
at  the  Defense  Mapping  Agency  Aerospace  Center  (DMAAC),  for  test  and  evaluation.  The 
VDE  system  is  intended  for  use  in  entering,  by  voice,  cartographic  data  to  the  Digital 
Landmass  System  (DLMS)  Data  Base.  The  VDE  system  developed  had  the  capability  of 
recognizing  up  to  248  separate  words  in  syntactic  structures. 

The  two  systems  described  are  isolated  utterance  speaker  dependent  systems.  For 
inputting  a  string  of  words,  this  requires  a  distinct  pause  between  each  word.  Tests 
have  shown  that  isolated  word  systems  are  three  times  slower,  and  more  frustrating  than 
normal  voice  data  entry.  This  increases  errors  and  further  decreases  the  data  entry 
speed.  However,  in  many  applications  the  emphasis  is  to  input  connected  digits  and 
isolated  words  or  phrases.  In  these  applications  many  of  the  functions/commands  have 
been  reduced  to  a  set  of  digit  codes  well  understood  by  the  analysts. 

Presently  RADC  is  developing  an  Advanced  Development  Model  (ADM)  Voice  Data  Entry 
System  to  satisfy  DMA's  operational  requirements  for  automated  compilation  of  the 
Feature  Analysis  Data  Table  (FADT)  for  DLMS  operation.  This  system  will  incoiporate  a 
limited  vocabulary  which  may  be  entered  in  connected  or  normal  speech,  and  an  extended 
vocabulary  which  will  be  entered  in  an  isolated  speech  mode. 

RADC  is  also  investigating  voice  interactive  I/O  algorithms  to  input  a  limited 
vocabulary  spoken  in  continuous  text  into  a  computer  with  a  voice  synthesis  feedback 
capability.  The  algorithms  shall  be  capable  of  recognizing  a  300  word,  syntax 
independent  vocabulary.  The  recognition  shall  be  done  in  real  time  using  a  pretrained 
reference  1 ibrary  . 

Automatic  Speech  Data  Entry  Systems  have  application  to  many  Air  Force  command, 
control  and  communication  problems.  However,  the  cost,  size,  weight,  and  power 
consumption  of  these  devices  must  be  reduced  for  many  applications.  RADC  is  currently 
looking  at  Very  Large  Scale  Integration  (VLSI)  technology  and  microprocessor  technology 
as  a  means  of  reducing  cost,  size,  weight,  and  power  consumption  of  VDE  devices  (See 
Fig.  19). 

8.  DUD  and  NATO  Advisory  Groups  on  Voice  Technology 

At  the  present  time,  two  major  military  automatic  speech  recognition  and  technology 
groups  are  pursuing  active  technical  coordination,  data  exchange  and  cooperative 
research  projects.  The  first  is  the  DUO  approved  Voice  Technology  for  Systems 
Applications  Sub-technical  Advisory  Group  (VSTAG).  The  purpose  of  this  VSTAG  is  to 
provide  a  forum  for  technical  interaction  between  scientists  and  engineers  at  the  bench 
level.  Included  as  representatives  to  the  VSTAG  are  members  of  the  Air  Force,  Army, 
Navy,  NASA,  FA A ,  Post  Uffice  and  NSA  research  laboratories  that  are  engaged  in  speech 
processing  applications.  Table  3  lists  the  members  of  VSTAG. 

The  second  is  the  NATO  AC/243  Panel  III  Research  Study  Group  (RSG)-IO  for  Speech 
Processing.  The  first  meeting  of  RSG-10  was  held  in  Paris,  France  in  May  1978. 
Meetings  are  held  twice  a  year  and  are  rotated  among  the  member  nations.  The  technical 
objectives  of  RSG-10  are  generally  to  review  speech  processing  topics  of  military 
relevance  in  order  to  recommend  specific  research  projects  to  be  carried  out 
cooperatively  among  the  member  nations.  Member  nations  include  Canada,  France, 
Germany,  Netherlands,  United  Kingdom  and  the  United  States.  Table  4  is  a  list  of  of 
RSG- 1U  participants. 


TABLE  3 


Army 

AK1  Army  Research  Institute 

tTL  Engineering  Topographic  Laboratory 

AVRAUA  Avionics  Research  Development  Activity  Human  Engineering  Lab 
Communicative  Technology  Office 

Navy 

NAMRL  Naval  Aerospace  Medical  Research  Lab. 

NADC  Naval  Air  Development  Center 

ONR  Office  of  Naval  Research 

NOSC  Naval  Ocean  Systems  Center 

NPGS  Naval  Post  Graduate  School 

NATC  Naval  Air  Test  Center 

NNMC  National  Naval  Medical  Center 

NWC  Naval  Weapons  Center 

NASC  Naval  Air  Systems  Command 

NTEC  Naval  Training  Equipment  Center 
NPRDC  Navy  Personnel  RSO  Center 

Air  Force 

AFAMRL  Aero  Medical  Research  Lab. 

RADC  Rome  Air  Development  Center 

AFWAL  Air  Force  Wright  Aeronautic  Lab 

AFIT  Air  Force  Institute  of  Technology 

Other  Government  Agencies 

IRS  Internal  Revenue  Service 

USDA  Dept,  of  Agriculture 

NBS  National  Bureau  of  Standards 

NSA  National  Security  Agency 

OOSURE  Office  of  the  Onder  Secretary  of  Defense  for  Research  Engineering 

NASA  Ames  Research  Labs 

OS  Public  Health  Service 
FAA  Federal  Aviation  Administration 


TABLE  4 

Mr  John  S .  Bri dl e  UK 

Dr  M.  Martin  Taylor  Canada 

Dr  Harmut  Mutschler  FR  Germany 

Mr  Patrice  DesVergnes  France 

Dr  Harman  J.  Steeneken  Netherlands 
Mr  Richard  S.  Vonusa  USA 

Dr  Helmut  Mangold  FR  Germany 

Dr  Joseph  J.  Mariani  France 

Dr  Melvyn  J.  Hunt  Canada 

Dr  Roger  K.  Moore  UK 

Dr  Robert  Breaux  USA 

Dr  David  Pall ett  USA 

9.  Future  Direction 

Since  its  inception,  research  in  automatic  speech  recognition  (ASR)  has  progressed 
to  the  point  where  military  application  can  be  a  reality.  Man's  most  natural  means  of 
communication  will  be  the  future  method  of  interaction  with  man's  machine.  Progress 
has  been  slow  but  steady  and  excellent  success  has  been  demonstrated  on  Isolated  word 
recognition  devices  and  speech  synthesis  devices  to  make  them  practicable  for  military 
use.  This  has  increased  the  interaction  among  scientists  of  various  disciplines 
Including  interchanges  -  interaction  in  acoust i c -phonet 1 cs ,  linguistics,  signal 
processing,  etc.  In  fact,  as  we  have  seen,  international  participation  in  the  solution 
of  numerous  ASR  problems  is  at  hand. 

However,  although  we  have  come  a  long  way  we  still  have  a  long  way  to  go. 
Presently,  we  are  too  strongly  focused  on  applications  to  extend  the  minimal  support 
given  to  a  number  of  fundamental  Issues.  In  fact,  before  ASR  can  even  approach  human 
performance,  we  still  need  significant  advances  in  acousti c -phoneti cs  relationships  and 
English  phonology. 
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AUTOMATIC  SPEECH  RECOGNITION  (ASR) 

SPEECH  SYNTHESIS/VOICE  RESPONSE 
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SPEAKER  IDENTIFICATION/VERIFICATION 

LANGUAGE  IDENTIFICATION  '• 

SPEECH  ENHANCEMENT 

VOCODER  __ 

SPEECH  DETECTION 

SPEECH  SPEED  RATE  CHANGE 

Fig.  1  Voice  processing  technology  terminology 


Fig.  2 


Diagram  of  a  general  voice-interactive  system 


VOICE-INTERACTIVE  SYSTEM 


Fig. 3  Voice-interactive  system  in  cockpit  setting 
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ENGINEERING 

1.  CAN  BE  FASTER  THAN  OTHER  MODES  OF  COMMUNICATIONS 

2.  CAN  BE  MORE  ACCURATE  THAN  OTHER  COMMUNICATION 

MODES 

3.  COMPATIBLE  WITH  EXISTING  COMMUNICATION  SYSTEMS, 

E.G.  TELEPHONES 

4.  CAN  BE  MORE  ACCURATE  AT  TASKS  CURRENTLY 

PERFORMED  BY  HUMANS,  E.G.  AUTOMATIC  SPEAKER 
VERIFICATION  vs  IDENTITY  VERIFICATION  BY  HUMAN 
VISUAL  INSPECTION 

5.  CAN  REDUCE  MANPOWER  REQUIREMENTS 

6.  CAN  BE  MOST  COST-EFFECTIVE  MAN-MACHINE  INTERFACE 


PSYCHOLOGICAL 

1.  MOST  NATURAL  FORM  OF  HUMAN  COMMUNICATION 

2.  BEST  FOR  GROUP  OR  TEAM  PROBLEM  SOLVING 

3.  UNIVERSAL  (OR  NEARLY  SO)  AMONG  HUMANS  &  REQUIRES 

NO  TRAINING 

4.  CAN  CONTAIN  VALUABLE  INFORMATION  REGARDING 

EMOTIONAL  STATE  OF  SPEAKER 

5.  CAN  REDUCE  VISUAL  &  MOTION  INFORMATION  OVERLOAD 

6.  CAN  REDUCE  VISUAL  &  MOTOR  WORKLOAD 

7.  INCREASES  IN  VALUE  PROPORTIONAL  TO  COMPLEXITY  OF 

INFORMATION  BEING  PROCESSED 

8.  CAN  REDUCE  ERRORS  FOR  TASKS  INVOLVING  CONSIDERABLE 

COGNITIVE  (AS  OPPOSED  TO  PERCEPTUAL)  EFFORT 


PHYSIOLOGICAL 

1.  REQUIRES  LESS  EFFORT  &  MOTOR  ACTIVITY  THAN  OTHER 

COMMUNICATION  MODES 

2.  FREES  EYES  &  HANDS  &  DOES  NOT  REQUIRE  PHYSICAL 

CONTACT  WITH  TRANSDUCER 

3.  PERMITS  MULTI-MODAL  OPERATION 

4.  POSSIBLE  EVEN  IN  DARKENED  ENVIRONMENTS 

5.  OMNI-DIRECTIONAL  &  DOES  NOT  REQUIRE  DIRECT  LINE  OF 

SIGHT 

6.  PERMITS  CONSIDERABLE  OPERATOR  MOBILITY 

7.  CONTAINS  INFORMATION  ABOUT  IDENTITY  OF 

COMMUNICATOR 

8.  CONTAINS  INFORMATION  REGARDING  PHYSICAL  STATE  OF 

THE  COMMUNICATOR 

9.  SIMULTANEOUS  COMMUNICATIONS  WITH  HUMANS  & 

MACHINES 


Fig. 4  Advantages  of  speed)  communications 


1.  COMPETING  ACOUSTIC  SOURCES  MAY  INTERFERE  WITH  SPEECH. 

THESE  INCLUDE  NOISE.  DISTORTION,  &  OTHER  TALKERS 

2.  VARIETY  OF  PHYSICAL  CONDITIONS  CAN  CHANGE  ACOUSTIC 

CHARACTERISTICS  OF  SPEECH,  INCLUDING  VIBRATION,  G-FORCES, 

&  PHYSICAL  ORIENTATION  OF  SPEAKER 

3.  HUMAN  FATIGUE  CAN  RESULT  FROM  PROLONGED  SPEAKING  & 

FATIGUE  MAY  CHANGE  SPEECH  CHARACTERISTICS 

4.  PHYSICAL  AILMENTS  SUCH  AS  COLDS  MAY  CHANGE  SPEECH 

CHARACTERISTICS 

5.  SPEECH  IS  NOT  PRIVATE  &  MAY  BE  OBSERVED  BY  OTHERS 

6.  NO  PERMANENT  RECORD  OF  SPEECH  UNLESS  RECORDED  EXPLICITLY 

(NOT  TRUE  OF  TYPING) 

7.  PSYCHOLOGICAL  CHANGES  (STRESS  FOR  EXAMPLE)  IN  SPEAKER  MAY 

CHANGE  HIS  SPEECH  CHARACTERISTICS 

8.  MICROPHONES  REQUIRED  FOR  SPEECH  INPUT,  ACOUSTIC  SPEAKERS 

FOR  SPEECH  OUTPUT 

9.  SPEECH  SYNTHESIS  MAY  INTERFERE  WITH  OTHER  AURAL  INDICATORS 

10.  SPEECH  SYNTHESIS  MORE  SERIAL  INFORMATION  CHANNEL  THAN 

VISUAL  DISPLAYS  &  CAN  BE  SLOWER 


Fig.5  Disadvantages  of  speech  communications 


•  LIMITED  CHANNEL  CAPACITY 


•  BETTER  NOISE  IMMUNITY 


INCREASED  JAM-RESISTANCE 


•  COST  ADVANTAGES 


Fig. 6  Bandwidth  reduction  needed  because: 
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UNDER  200  bps  SYSTEMS 

•  SPEAKER  DEPENDENT 

•  LIMITED  VOCABULARY 

200-400  bps  SYSTEMS 

•  SPEAKER  INDEPENDENT 

•  UNLIMITED  VOCABULARY 


Fig. 7  State  of  the  art 
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Fig. 9  BISS  in-house  test  facility 
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GENERAL  CONSTRAINTS 

TEXT  -  INDEPENDENCE 
UNCOOPERATIVE  SPEAKERS 

BAND  -  LIMITED  &  NOISY  COMMUNICATIONS  CHANNELS 
CHANNEL  CONDITIONS  MAY  VARY  FOR  SPEECH 
COLLECTED  FOR  REFERENCE  &  TEST  SAMPLES 

OPERATIONAL  CONSTRAINTS 

MUST  OPERATE  ON  -  LINE  &  IN  REAL  -  TIME 
SPEECH  SEGMENTS  AVAILABLE  FOR  REFERENCES  & 
UNKNOWNS  MAY  BE  VERY  SHORT 
MUST  WORK  RELIABLY  FOR  SEVERAL  LANGUAGES 

Fig.  10  Speaker  authentication  problem 


FEATURES: 

•  HUMAN  HAS  OPPORTUNITY  TO  OVERRIDE  MACHINE'S  DECISION 

•  HUMAN  MAY  UPDATE  FILES  WHEN  HE  DESIRES 

•  HUMAN  MAY  LISTEN  TO  MOST  RECENT  SPEECH  DATA  FOR  ANY  SPEAKER 

•  OPERATION  IS  REAL-TIME,  ON-LINE,  CONTINUOUSLY 


Fig.  11  Speaker  identification 
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U  -- 

o  -- 


0 


0 


1982  SHORT  UTTERANCE  ALGORITHM 

1982  RESULTS  (  REFLECTION  COEFFICIENTS 
CEPSTRAL  COEFFICIENTS,  SPECTRAL  SLOPE  ) 

1980  RESULTS  (  REFLECTION  COEFFICIENTS  ) 


- 1 - 1 - \— 

3  5  10 

LENGTH  OF  UNKNOWN  SAMPLE  (  SECONDS  ) 


Fig.  1  2  Automatic  speaker  recognition  performance 
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Fig.  14  Wideband  noise  removal  process 
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Fig.  1 5  Speech  enhancement  test  results 
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Fig.  16  SEU/LPC  based  recognition  performance 
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Summary 

This  talk  is  intended  to  provide  an  introduction  to  the  speech  sig¬ 
nal  with  particular  emphasis  on  the  recognition  of  spoken  mes¬ 
sages.  In  an  attempt  to  clarify  its  nature,  speech  is  compared  with 
two  other  kinds  of  signal.  Some  properties  of  words  and  phonemes 
are  considered,  and  it  is  concluded  that,  unlike  many  artificial 
message-bearing  signals,  speech  cannot  be  considered  as  a  simple 
sequence  of  independent  message  units.  A  speaker  adjusts  the 
amount  of  information  in  his  speech  to  suit  his  listener,  and  the 
listener  carries  out  an  active  reconstruction  of  the  message  from 
the  information  available  to  him.  Turning  to  recognition  by 
machine,  the  use  of  syntactic  constraints  is  first  discussed,  fol 
lowed  by  a  look  at  three  kinds  of  approach  to  the  analysis  and 
representation  of  speech  for  recognition  purposes.  A  brief  account 
of  speech  production  is  provided  in  order  to  explain  the  motivation 
for  production-based  representations  This  is  followed  by  a  look  at 
how  knowledge  of  auditory  perception  has  been  incorporated  into 
recognition  systems  Finally,  some  purely  pragmatic  approaches 
are  discussed,  and  it  is  argued  that  success  here  generally  corre¬ 
lates  with  simplicity. 


Introduction 

This  session  is  concerned  with  the  nature  of  the  speech  signal  itself  the  signal  that 
allows  one  human  being  to  communicate  to  another  whatever  message  he  consciously 
chooses  to  express,  with  no  external  aids  and  usually  with  very  little  effort.  As  «uch,  it 
is  the  session  furthest  removed  from  applications  of  speech  technology.  1  am  assuming 
that  you,  the  audience  or  the  readers  of  the  proceedings,  are  mostly  not  speech  spe 
cialists  but  rather  people  interested  in  how  speech  technology  can  be  used.  1  am 
therefore  not  going  to  try  to  give  you  a  comprehensive  account  of  speech  production 
or  phonetics  or  linguistics.  Instead,  I  want  to  put  to  you  a  few  general  ideas  about  the 
speech  signal.  My  hope  is  that  these  ideas  may  provide  a  clearer  picture  of  what  peo 
pie  trying  to  make  speech  recognizers  are  up  against,  which  recognition  tasks  arc 
difficult  and  which  relatively  easy. 

Before  I  start,  I  should  mention  a  problem  often  faced  by  speech  researchers  in 
describing  their  work:  if  this  lecture  series  were  about  some  newly  developed  or  newly 
discovered  signal  we  could  address  an  audience  free  of  preconceptions,  ready  to 
accept  whatever  we  had  to  tell  them.  But  everyone  can  speak,  and  so  everyone  already 
has  some  strong  subjective  ideas  about  the  speech  signal.  What  is  worse,  most  people 
can  read,  and  their  knowledge  of  the  written  representation  generally  has  a  strong 
effect  on  how  they  think  of  the  spoken  signal.  I  will  come  back  to  this  point  later.  For 
the  moment,  perhaps  you  might  try  to  forget  that  you  can  speak  or  read. 

What  sort  of  a  signal  is  speech? 

In  trying  to  answer  this  question,  I  think  it  is  helpful  to  start  off  by  considering  two 
other  types  of  signal  that  speech  is  sometimes  grouped  with.  The  first  is  a  class  of  sig 
nals  that  are  subjected  to  image  processing.  To  be  specific,  let  us  choose  a  satellite 
image  of  a  portion  of  the  earth.  Such  an  image  has  the  obvious  difference  that  it  is 
two  dimensional  while  the  speech  signal  is  effectively  one  dimensional.  The  more 
important  difference,  though,  is  that  the  satellite  image  is  not  a  communication:  it 
contains  information  but  it  does  not  contain  a  message.  The  very  same  image  might  be 
used  to  study  the  vegetation  of  an  area  or  to  try  to  spot  missile  silos,  but  presumably 
the  image  processing  techniques  appropriate  for  the  one  task  would  be  quite  different 
from  those  appropriate  for  the  other.  Thus,  image  processing  tends  to  be  a  loose  col 
lection  of  techniques  with  diverse  goals  Depending  on  which  field  we  want  to  flatter, 
we  can  describe  the  automatic  speech  recognition  problem  as  more  limited  or  as 
more  coherent  than  image  processing 


The  image  processing  problem  we  just  discussed  is  rather  like  the  problem  in  the 
speech  field  of  determining  the  identity  or  the  emotional  state  of  a  speaker  from  a 
speech  sample,  since  we  the  receivers  are  deciding  what  information  we  want  io  derive 
from  the  signal  rather  than  trying  to  extract  the  message  being  intentionally  supplied 
by  the  speaker.  The  rest  of  this  session,  however,  and  indeed  most  of  this  whole 
series,  is  concerned  with  the  problem  of  recognizing  or  efficiently  transmitting  the 
intended  message,  not  the  side  information  that  may  come  with  it. 

The  discussion  that  follows  also  excludes  certain  kinds  of  social  communication  such 
as  "Hello,  how  are  you?",  where  the  speaker  is  not  so  much  enquiring  into  the  state  of 
health  of  the  listener  as  making  a  semi  voluntary  announcement  of  his  feelings  and 
relationship  to  the  listener.  This  use  of  speech  seems  similar  to  the  way  in  which  a 
dog  might  bark  a  greeting  at  its  master  or  a  threat  at  an  intruder.  It  is  not  what 
makes  human  speech  special,  and  it  is  not  of  primary  interest  in  communicating  with 
machines. 

The  second  kind  of  signal  I  would  like  to  have  you  consider  is  a  man  made  artificial 
communications  signal.  We  could  Lake  as  a  specific  example  another  optically  derived 
signal  like  the  output  of  a  scanner  reading  product  codes  in  a  supermarket,  but  1 
think  a  better  one  is  provided  by  an  h.f.  radio  transmission  carrying  teleprinter  text. 
In  this  example,  there  is  quite  clearly  a  message,  and  the  message  is  laid  out  sequen 
tially  in  time  or  space  just  like  speech  The  similarities  to  speech  are  obvious;  the 
differences  much  less  so,  but  they  are  nonetheless  large  and  1  want  to  Lake  some  time 
to  look  at  them. 

The  artificial  signals  in  our  examples  are  composed  of  a  sequence  of  units,  the  units 
being  selected  from  a  definite,  known  set  that  1  want  to  call  an  alphabet.  The  units  in 
a  message  are  generally  well  separated  from  each  other,  and  they  do  not  interact. 
The  decoding  device  usually  has  available  to  it  in  some  form  an  ideal,  undistorted 
representation  of  the  alphabet,  and  decoding  consists  mainly  of  trying  to  identify  the 
received  units  one  by  one  using  its  built-in  knowledge  of  the  ideal  forms. 

Words 

What  is  the  equivalent  of  these  units  for  the  speech  signal?  I  contend  that  there  is  no 
single  exact  equivalent.  Perhaps  the  closest  candidate  is  the  word,  but  words  differ  in 
several  major  respe^U  from  our  artificial  units. 

First  of  all  -  notwithstanding  our  prejudices  from  the  written  form  of  language  -  spo¬ 
ken  words  do  not  in  general  have  gaps  between  them.  Indeed,  there  are  no  consistent 
acoustic  cues  of  any  kind  to  word  boundaries.  What  is  more,  not  only  are  words  not 
well  separated  from  each  other,  they  often  interact  at  their  boundaries.  For  instance, 
"bread  board"  is  often  pronounced  in  fluent  English  in  a  way  that  we  might  write  as 
"breab  board",  and  "this  shop”  as  "thish  shop".  There  is  a  more  extreme  example  in 
French  in  the  phenomenon  of  liaison:  "ils  ouvrent"  (they  open)  sounds  different  from 
"il  ouvre"  (he  opens)  because  we  hear  the  "s”  of  ils  in  the  first  case.  But  it  takes  the 
initial  vowel  in  ouvrent  to  bring  the  ”s"  to  life:  the  corresponding  expressions  for  clos¬ 
ing  -  "ils  ferment"  and  "il  ferme"  -  do  not  have  any  distinction  in  their  pronunciation. 
Next,  we  know  of  no  ideal  reference  forms  of  words:  any  normally  pronounced  version 
of  a  word  is  as  good  as  any  other,  and  no  two  productions  will  ever  be  exactly  the 
same.  In  particular,  words  differ  in  their  prosodic  features  (intonation,  timing  and 
loudness)  depending  on  their  function  in  a  sentence.  Even  in  such  a  prosaic  utterance 
as  a  list  of  digits,  the  final  digit  difTers  markedly  from  the  others,  being  typically  60% 
longer  and  having  a  falling  intonation.  When  people  try  to  generate  synthetic  sen¬ 
tences  by  recording  words  in  isolation  and  playing  them  back  unmodified  in  a 
sequence,  the  result  is  disastrous  -  each  word  is  perfectly  clear,  but  the  sentence  is 
almost  impossible  to  follow. 

Despite  these  problems  with  words,  most  of  the  more  successful  and  practical  con 
nected  speech  recognizers  have  been  word-based.  As  will  be  explained  elsewhere,  ways 
have  been  found  to  ignore  prosodic  differences  and  concentrate  on  the  phonetic  iden 
tity  of  words. 

Phnne.me.s 

When  I  suggested  words  as  the  best  equivalent  of  the  artificial  communication  units,  1 
imagine  some  of  you  were  surprised  that  I  did  not  choose  phoneme s.  Such  surprise 
would  be  understandable  considering  the  number  of  popular  articles  on  speech  tech 
nology  that  talk  about  speech  being  made  up  of  phonemes  as  though  it  were  like  lay 
ing  out  bricks  in  a  line  just  like  the  symbols  in  our  teleprinter  transmissions,  in  fact 
Proponents  of  phonemes  might  also  point  out  that  the  phoneme  inventory  (just  over 
forty  in  English)  is  much  more  manageable  more  alphabet  sized  than  the  enormous 
inventory  of  words  in  a  language  Some  people  might  also  be  influenced  by  the  way 
words  are  printed  as  a  string  of  discrete  context  independent  letters  Despite  all  this, 
I  want  to  suggest  to  you  that  phonemes  bear  very  little  resemblance  to  teleprinter 
symbols.  If  you  would  like  a  writing  analogue  for  phoneme  sequences,  quite  a  good  one 
is  provided  by  hastily  scribbled  handwriting,  in  which  individual  letters  art'  hard  to 


isolate  and  depend  for  their  form  very  much  on  the  other  letters  around  them 
A  phoneme  is  defined  as  the  smallest  unit  of  speech  within  a  word  that  when  changed 
results  in  a  change  in  the  meaning  of  the  word.  Thus,  the  English  word  tap  differs 
from  the  English  word  cap  in  the  position  of  the  tongue  at  the  start  of  the  two  words. 
In  tap  the  point  of  contact  between  the  tongue  and  the  roof  of  the  mouth  is  just 
behind  the  upper  teeth,  while  in  cap  it  is  at  a  point  quite  far  back  in  the  mouth.  We 
can  conclude  that  cap  and  tap  must  start  with  a  different  phoneme.  We  could  have 
started  with  the  tongue  making  contact  in  other  places:  it  could  have  been  directly 
behind  the  upper  teeth  like  the  ”t"  sound  in  eighth,  or  the  tip  of  the  tongue  could  have 
been  curled  back  slightly  like  the  "t"  in  tree.  If  we  used  either  of  these  "t"  sounds  in 
our  word  tap  we  would  not  get  a  new  word,  we  would  simply  have  tap  with  a  slightly 
non-standard  pronunciation  -  we  might  not  even  notice  that  the  word  sounded  odd  if  it 
occurred  in  fluent  speech.  Yet  those  same  ”t"  sounds  represent  different  phonemes 
for  some  other  languages.  For  speakers  of  such  languages  (which  include  several 
languages  spoken  on  the  Indian  subcontinent)  the  "t”  variants  presumably  sound  quite 
distinct.  In  the  same  way,  the  English  "1"  and  "r"  sounds  in  words  like  lap  and  rap, 
which  sound  quite  different  to  English  speakers,  do  not  correspond  to  different 
phonemes- in  Japanese,  so  Japanese  speakers  have  difficulty  in  making  the  distinction. 
Phonemes,  then,  are  not  "speech  sounds"  in  some  absolute  sense,  they  are  a  property 
of  the  way  a  language  gets  coded  in  sound,  and  their  phonetic  realization  is  frequently 
context  dependent.  Something  interesting  is  happening  in  standard  French  right  now 
the  vowel  sounds  in  the  digits  deux  and  neuf  used  to  be  different  phonemes,  that  is  to 
say,  there  existed  at  least  one  pair  of  words  -  jedne  and  jeune  are  usually  cited  that 
differed  just  by  the  fact  that  the  first  had  the  deux  vowel  in  it  and  the  second  the  neuf 
vowel.  French  speakers  are  increasingly  using  a  new  rule  that  says  that  the  deux  vowel 
can  occur  only  at  the  end  of  a  word  and  the  neuf  vowel  only  at  a  non-final  position  in  a 
word.  Thus  the  jetine/jeune  distinction  is  lost,  and  the  two  vowels  have  become 
context-dependent  allophones  of  the  same  phoneme.  French  has  lost  a  phoneme,  but 
it  has  not  lost  a  speech  sound. 

So  far,  we  have  established  that  phonemes  do  not  correspond  to  a  single  speech 
sound,  but  perhaps  we  could  say  that  it  corresponds  to  a  set  of  sounds.  If  by  "sounds" 
we  mean  something  we  can  hear  and  identify  in  isolation,  the  answer  has  to  be  no,  or 
at  least  not  always.  The  English  word  do  is  made  up  of  two  phonemes  /d/  and  /u/ 
(phonemes  are  conventionally  written  between  oblique  lines),  but  there  is  no  way  of 
pronouncing  the  /d/  without  also  pronouncing  a  vowel  either  before  or  after  it.  What  is 
more,  if  we  take  a  recording  of  do  and  listen  to  what  happens  as  we  shorten  it  by  sue 
cessively  chopping  off  more  and  more  of  the  vowel,  we  never  get  to  hear  a  /d/  in  isola 
tion:  when  we  have  shortened  it  enough  that  we  no  longer  hear  the  vowel,  we  no  longer 
hear  anything  that  we  perceive  as  speech. 

The  picture  of  what  a  phoneme  might  be  in  acoustic  terms  gets  even  fuzzier  when  we 
start  to  ask  about  the  acoustic  features  a  listener  might  use  to  decide  what  phoneme 
sequence  he  is  hearing.  By  using  a  speech  synthesizer,  researchers  have  been  able  to 
vary  the  properties  of  speechlike  sounds  and  so  investigate  the  phonetic  cues  that 
listeners  use.  It  turns  out  that  listeners  often  do  not  depend  on  a  single  cue  but  rather 
weigh  the  evidence  from  several  independent  features.  Some  results  have  been  partic¬ 
ularly  surprising.  For  example,  the  words  ones  and  once  are  normally  felt  to  differ  just 
in  their  last  phoneme,  ones  ending  in  the  voiced  phoneme  /z/  and  once  in  the 
corresponding  voiceless  phoneme  /s/  (in  voiced  sounds  the  vocal  cords  act  as  a 
quasi-periodic  sound  source;  in  voiceless  sounds  they  do  not);  but  it  is  possible  to 
change  a  listener’s  judgment  of  which  word  he  is  hearing  merely  by  altering  the  length 
of  the  /n/  sound  (a  longer  /n/  indicating  ones),  and  indeed  it  seems  likely  that  this  is 
the  most  important  phonetic  cue  in  discriminating  between  these  words  in  natural 
speech.  Here  we  have  an  example,  then,  where  the  major  distinguishing  mark  of  a 
phoneme  is  not  only  not  what  we  would  expect  it  to  be,  it  is  not  even  where  we  would 
expect  to  find  it. 

Moreover,  some  work  carried  out  in  England  [  1]  has  shown  that  cues  to  phoneme  iden¬ 
tity  are  not  even  entirely  confined  to  the  auditory  channel:  in  appropriate  cir 
cumstances  visual  cues  can  be  integrated  into  speech  perception.  The  point  has  been 
convincingly  demonstrated  by  synchronizing  a  recording  of  a  stop  consonant  vowel 
sequence  e  g.  "ba"  with  a  video  recording  of  a  person  producing  a  different  stop 
consonant  followed  by  the  same  vowel  -  e  g.  "ga”.  The  perception  of  the  sound  is 
strongly  modified  by  the  conflicting  visual  cues  in  the  ba/ga  example  what  is  per 
ceived  is  "da".  The  effect  has  perhaps  to  be  seen  to  be  fully  believed:  when  1  saw  the 
demonstration  I  "heard”  a  perfectly  natural  "da”  whilever  1  watched  the  screen;  it 
reverted  to  "ba”  as  soon  as  I  heard  it  while  looking  away  from  the  screen. 

I  hope  all  this  is  beginning  to  convince  you  that  speech  cannot  be  considered  as  a 
sequence  of  speech  sounds  in  the  way  that  the  teleprinter  transmission  is  a  sequence 
of  teleprinter  symbols 


Fig.  1  A  visual  paradox:  Cube  urith  Magic  Ribbons  by  M.C.  Escher  (courtesy  M.C. 
Escher  Foundation,  the  Hague). 


Thp  active  nature  of  spprrh  perception 

Before  I  go  on,  I  would  like  you  to  look  at  the  M.C.  Escher  drawing  reproduced  in  Fig¬ 
ure  1.  It  seems  paradoxical:  among  other  problems  that  it  poses,  the  circular  objects 
on  the  ribbon  seem  to  flip  from  pointing  outwards  to  pointing  inwards.  If  we  were  able 
to  regard  it  as  a  meaningless  pattern  of  different  shades  of  grey  on  a  Hat  piece  of 
paper,  there  would  be  no  paradox.  But  it  seems  all  but  impossible  to  restrain  our 
minds  from  attempting  to  reconstruct  a  three-dimensional  object  out  of  the  pattern, 
even  when  such  a  reconstruction  cannot  be  made  to  work.  The  picture  illustrates  the 
point  that  visual  perception  does  not  work  simply  by  recording  ihe  light  entering  the 
eye,  but  rather  by  actively  trying  to  "make  sense"  of  that  light.  To  give  another  exam¬ 
ple,  we  can  perceive  the  color  brown,  but  there  is  no  such  thing  as  brown  light: 
apparently  our  brain  deduces  the  "browr.ness"  of  an  object  by  comparing  the  quality 
of  the  light  reflected  from  the  object  to  that  of  the  light  that  our  brain  computes  to  be 
striking  the  object. 

I  want  to  suggest  that  our  hearing  is  similar:  it  no  more  works  like  a  microphone  than 
our  vision  works  like  a  camera  In  particular,  listening  to  speech  is  not  a  passive 
detection  of  an  acoustic  signal:  it  is  an  active  reconstruction  of  the  transmitted  mes 
sage. 

This  reconstruction  is  so  effective  that  we  frequently  do  not  notice  that  apparently 
important  information  is  missing,  until  we  have  to  deal  with  something  unfamiliar  like 
an  unusual  name,  we  are  hardly  aware  that  it  is  not  possible  to  distinguish  between  an 
"s"  sound  and  an  "f”  sound  on  the  telephone;  and  synthetic  speech  in  which  all  the 
voiceless  sounds  have  been  replaced  by  silence  seems  surprisingly  normal,  particu 
larly  if  there  is  some  background  hiss  that  our  brain  can  take  to  be  voiceless  frica 
tives 

It  is  the  reconstruction  that  gives  us  such  a  firm  impression  that  the  speech  signal 
consists  of  a  neat  sequence  of  phonemes:  it  may  indeed  be  possible  to  describe  speech 
in  this  way,  but  only  at  a  certain  stage  of  processing  in  our  brains,  not  at  the  level  of 
the  acoustic  signal 

In  reconstructing  the  speech  message  the  listener  can  use  information  from  several 
different  sources  We  have  already  mentioned  prosodic  cues  such  as  intonation  that 
generally  indicate  sentence  structure,  and  phonetic  cues  that  indicate  word  struc 
ture  There  are  rules  that  govern  the  order  in  which  words  can  be  uttered  in  the  syn 
tax  of  a  language,  and  other  constraints  labeled  as  semantics  that  provide  that  most 
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sentences  should  be  meaningful.  Syntax  and  semantics  are  clearly  linked,  bul  they  arc 
at  least  partially  separable:  Chomsky's  sentence  colorless  green  ideas  sleep  furiously 
is  syntactically  acceptable  but  meaningless,  while  Me  Tarzan,  )'ou  Jane  breaks  the 
rules  of  standard  English  syntax  but  was  nevertheless  meaningful  to  the  cinema  audi 
ences  that  heard  it.  To  our  list  of  information  sources  we  might  also  add  external  con 
text,  that  is,  whether  an  utterance  is  germane  to  the  situation,  and  whether  it  is  the 
sort  of  remark  the  speaker  frequently  makes  under  his  present  circumstances 
Finally,  the  work  with  synchronized  video  recordings  demonstrates  that  in  some  cir 
cumstances  optical  information  is  used  in  reconstructing  the  speech  message 
This  leads  me  to  point  out  another  way  in  which  speech  differs  from  the  teleprinter 
transmission,  namely,  the  fact  that  speech  has  to  be  regarded  as  a  multilevel 
sequence.  Thus,  words  can  be  thought  of  as  phoneme  sequences,  while  they  them 
selves  form  part  of  word  sequences  making  up  phrases,  which  in  turn  make  up  sen 
tences.  Evidence  needed  to  understand  speech  is  present  at  every  level,  and  in  all 
probability  the  evidence  at  all  levels  has  to  be  considered  simultaneously  if  the  mos 
sage  is  to  be  understood.  It  is  true  that  we  could  find  much  the  same  set  of  levels  in  a 
teleprinter  transmission  of  meaningful  text,  but  the  levels  are  not  so  intimately 
mixed:  in  order  to  decode  the  individual  teleprinter  symbols  we  do  not  even  need  to 
know  what  language  the  text  is  written  in. 

It  is  often  said  that  speech  is  a  very  redundant  signal.  As  evidence  for  this  assertion,  it 
might  be  pointed  out  that  the  same  utterance  can  be  understood  either  when  it  is 
low-pass  filtered  at  1kHz  or  when  it  is  high-pass  filtered  at  1kHz:  the  information  in 
the  lower  part  of  the  spectrum  must  somehow  be  duplicating  the  information  in  the 
upper  part.  I  believe  this  to  be  a  fallacious  way  of  looking  at  the  speech  signal.  The 
amount  of  information  one  needs  in  a  speech  signal  depends  on  how  skilled  one  is  at 
reconstructing  the  message:  1  need  much  higher  signal  quality  to  follow  spoken  French 
or  German  than  I  do  to  follow  spoken  English.  To  mangle  a  metaphor:  redundancy  is  in 
the  ear  of  the  beholder. 


This  brings  me  to  what  is  perhaps  the  most  important  point  in  this  talk.  It  is  that  peo 
pie  do  not  emit  speech  messages  to  be  picked  up  by  anyone  who  cares  to  listen,  they 
talk  to  someone.  Although  we  as  yet  know  too  little  about  speech  to  be  sure  about 
this,  it  seems  likely  that  a  speaker  puts  just  enough  cues  into  his  speech  to  allow  his 
listener  (or  imagined  listener  in  the  case  of,  say,  a  radio  broadcast)  to  be  able  to  com 
fortably  reconstruct  the  message  from  the  evidence  available.  Thus,  when  we  are 
saying  something  that  is  difficult  to  follow,  or  when  we  are  speaking  to  someone  we 
believe  to  be  foreign,  deaf  or  senile,  we  supply  more  phonetic  information  than  we 
would  in  a  relaxed  conversation  with  a  friend.  Elision  of  phonetic  information,  such  as 
when  we  say  fish,  ’n  chips,  is  often  described  as  being  due  to  laziness,  but  I  would 
argue  that  it  is  part  of  a  rational  strategy  for  the  economical  use  of  a  communications 
link:  it  would  be  lazy  only  if  the  person  at  the  other  end  of  the  link  were  obliged  to 
make  an  unreasonable  effort  to  reconstruct  the  message.  Depending  on  the  cir¬ 
cumstances,  overarticulation  can  be  just  as  inappropriate  as  underarticulation:  it  can 
sound  stilted,  irritating,  even  insulting  when  the  listener  feels  it  to  be  unnecessary. 

To  summarize  my  account  of  the  speech  signal  so  far,  I  have  tried  to  argue  that  it  is 
different  in  nature  both  from  the  messageless  signals  we  considered  first  and  from 
machine-generated  message-bearing  signals  like  the  teleprinter  transmission.  It  is  a 
signal  from  which  a  message  may  be  reconstructed  using  information  drawn  from 
many  sources,  both  information  at  various  levels  in  the  signal  itself  and  information 
stored  in  the  mind  of  the  listener.  The  amount  of  information  that  the  speaker  puts 
into  the  signal  depends  on  the  difficulty  that  he  imagines  the  listener  will  have  in 
reconstructing  the  message  from  it. 


The  speech  signal  and  speech  recognition 

I  would  like  to  turn  now  to  considering  the  speech  signal  more  specifically  in  relation 
to  automatic  speech  recognition. 

The  use  of  syntactic  constraints 

Because  we  are  directly  aware  of  speech  only  after  it  has  been  subjected  to  extremely 
sophisticated  processing  involving  information  from  several  sources,  it  is  all  too  easy 
to  underestimate  the  difficulty  of  deducing  a  spoken  message  purely  from  the  acous¬ 
tic  signal.  In  particular,  there  is  a  danger  of  expecting  to  find  products  of  a  high  level 
analysis  of  speech  such  as  phonemes  to  be  present  as  clearly  identifiable  entities  in 
the  acoustic  signal.  Presumably,  it  was  false  impressions  such  as  these  that 
influenced  the  author  of  a  recent  market  study  of  speech  technology  [2|  when  he 
predicted  that  commercially  viable  voice  activated  typewriters  would  be  available  in 
1987,  the  limiting  factor,  according  to  him,  being  the  cost  of  memory  to  store  a  large 
vocabulary. 

It  is  not  clear  to  me  that  it  would  ever  be  possible  for  a  machine  to  recognize  unres 
tricted  speech  with  high  reliability  purely  from  the  acoustic  signal  it  is,  at  best,  like 
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asking  a  human  listener  to  transcribe  accurately  a  language  he  does  not  understand 
Accurate  transcription  of  unrestricted  text  probably  requires  both  a  knowledge  of  the 
syntax  of  the  language  and  a  comprehensive  knowledge  of  the  world  Present  day 
practical  systems  limit  the  difficulty  they  face  cither  by  having  a  small  vocabulary  or 
by  having  a  larger  vocabulary  but  with  a  syntax  that  limits  the  choice  of  words  that 
can  follow  a  previous  word  or  sequence  of  words.  For  it  is,  of  course,  the  number  of 
choices  that  the  system  has  to  discriminate  between  that  determiners  the  di'fioulty  of 
a  recognition  task  not  the  total  number  of  words  it  has  to  recognize'  To  give  a  trivial 
example,  the  task  of  recognizing  the  two  word  vocabulary  Paris  and  London  is  made 
easier  if  we  go  to  a  four  word  vocabulary  by  including  the  words  Prance  and  England 
together  with  a  syntax  that  requires  that  the  city  must  be  followed  by  its  correspond 
ing  country. 

Some  of  the  most  ambitious  systems  using  syntactic  and  semantic  constraints  were 
constructed  as  part  of  the  ARPA  Speech  Understanding  Project  [3]  Systems  were 
devised  that  had  "knowledge’’  of  a  small  subset  of  the  syntactic  structures  possible  in 
English  and  an  "understanding"  of  a  small  universe  (such  as  facts  about  ships).  Since 
that  project  ended  in  1976,  however,  there  seems  to  have  been  a  considerable  reduc 
tion  in  interest  in  such  systems.  As  I  see  it,  apart  from  the  high  cost,  there  are  several 
good  reasons  for  this  loss  of  interest.  First,  there  is  the  problem  of  the  very  consider¬ 
able  effort  needed  to  specify  the  syntax  and  semantics  of  the  language  to  be  used  for  a 
particular  application.  This  prevents  speech  understanding  devices  from  being  sold  as 
off-the-shelf  devices.  Second,  there  is  a  problem  in  defining  what  is  known  technically 
as  a  habitable  subset  of  a  natural  language.  That  is  to  say,  as  the  syntactic  structures 
allowed  by  a  system  get  more  complex  and  the  language  one  can  use  gets  more 
natural,  it  gets  correspondingly  harder  to  teach  a  user  what  sentence  structures  are 
grammatical  to  the  recognizer  as  opposed  to  those  that  are  grammatical  in  the  user’s 
own  language  but  not  allowed  in  the  recognizer's  grammar.  Finally,  as  a  research  tool 
complex  systems  seem  to  me  to  be  unattractive  because  when  overall  performance 
depends  on  so  many  factors  it  is  difficult  to  draw  useful  conclusions  from  that  perfor 
mance  or  from  the  relative  performance  of  two  such  systems. 

I  wonder  if  there  is  perhaps  a  parallel  to  be  drawn  between  speech  recognition  devices 
and  robots.  Before  any  useful  robots  had  been  built  the  image  of  the  robot  was  of  a 
device  that  superficially  resembled  a  man;  but  real,  useful  robots  working  for  example 
in  car  factories  do  not  look  at  all  tike  humans.  Real,  useful  speech  recognizers  do  not 
use  a  syntax  that  superficially  resembles  natural  language,  though  they  do 
increasingly  use  a  task-oriented  syntax. 

An  example  of  a  very  simple  yet  effective  use  of  task-oriented  syntax  in  a  recognizer  is 
the  addition  of  a  check  digit  to  a  string  of  digits  to  be  recognized.  This  digit  would  typ 
ically  be  chosen  such  that  when  a  string  of  digits  including  the  check  digit  is  summed 
together  the  result  is  always  a  multiple  of  ten.  (For  instance,  the  string  1  1  1  would 
have  7  as  a  check  digit,  while  8  8  0  would  have  4.)  The  inclusion  of  the  check  digit  does 
not  reduce  the  total  number  of  possible  digit  strings  that  the  system  has  to  discrim 
inate  between  -  if  we  have  a  three-digit  string  there  are  a  thousand  possibilities 
whether  wc  add  a  fourth  check  digit  or  not  but  it  does  increase  the  amount  of  acous¬ 
tic  information  that  can  be  used  in  the  discrimination.  Alternatively,  we  can  view  the 
check  digit  as  having  made  discrimination  simpler  by  reducing  the  average  number  of 
choices  to  be  made  per  word.  This  average  number  of  choices  is  known  as  the  branch 
ing  factor,  and  it  -  or  a  generalization  of  it  when  the  choices  are  not  equiprohable  is 
often  used  as  a  measure  of  difficulty  of  recognition  tasks. 

The  use  of  devices  like  check  digits  is  not  as  alien  to  natural  language  as  it  might 
appear  A  similar  recognition  aiding  device  occurs,  1  believe,  in  all  Indo  European 
languages  except  the  one  I  am  using  now.  I  am  referring  to  the  division  of  nouns  into 
two  or  three  classes  according  to  what  is  called  the  gender  of  the  noun,  the  gender 
classes  being  called  variously  masculine,  feminine  and  (sometimes)  neuter,  or  neuter 
and  common.  To  illustrate  how  it  can  help,  consider  the  French  nouns  poisson  (fish) 
and  boisson  (drink)  that  are  quite  similar  in  pronunciation,  but  differ  in  that  the  first 
is  masculine  and  the  second  feminine.  When  we  meet  them  in  sentences  like 
l,e  X  est  un  poisson  ddlicieux  and  Le  X  est  une  boisson  dftlicieuse 
(X  is  a  delicious  fish/drink)  it  is  virtually  impossible  to  confuse  them  despite  their 
phonetic  similarity  because  the  form  of  the  adjective  and  the  indefinite  article  both 
depend  on  the  gender  of  the  noun  they  refer  to  and  arc  therefore  different  for  boisson 
and  poi.sson.  In  French  there  arc  only  two  noun  classes  against  ten  check  digits,  so 
instead  of  calling  gender  a  check  digit  wc  should  perhaps  better  describe  it  as  a 
linguistic  parity  bit,  but  the  principle  is  the  same. 

Approaches  In  .speech  analysis 

So  far,  I  have  tried  to  point  out  some  of  the  difficulties  in  analysing  the  speech  signal 
and  the  dangers  of  methods  based  on  introspection.  1  would  like  to  look  now  at  some 
approaches  that  have  proved  helpful.  Useful  approaches  to  the  treatment  of  the 
speech  signal  seem  to  fall  under  three  headings,  namely,  production  based 


approaches,  percept  ion  based  approaches  and  pragmatic  approaches.  No  automatic 
recognition  system  relies  totally  on  just  one  of  these  approaches,  but  in  most  systems 
one  approach  dominates. 

Product  ion  hased  approaches 

It  does  not  seem  immediately  obvious  why  we  should  approach  the  recognition  of 
speech  from  the  viewpoint  of  how  it  was  produced  -  we  do  not,  after  all,  need  to  know 
how  the  teleprinter  signal  was  generated  in  order  to  decode  it.  Nevertheless,  there  is 
a  whole  spectrum  of  arguments  in  favor  of  taking  speech  production  into  account 
when  analysing  speech.  They  range  from  the  most  moderate,  with  which  no  one  would 
argue,  to  the  most  extreme,  which  few  people  now  hold. 

Before  we  consider  these  arguments,  I  shall  have  to  break  off  from  my  main  line  of 
argument  for  a  little  while  in  order  give  you  a  brief  overview  of  how  speech  is  pro 
duced.  The  human  organs  primarily  involved  in  producing  speech  are  Lhe  Larynx, 
which  contains  the  vocal  cords,  and  the  pharynx  and  mouth  cavity,  which  together 
form  the  vocal  tract,  and  which  is  essentially  a  tube  leading  from  the  larynx  to  the 
lips.  A  side  branch,  the  nasal  cavity,  can  be  added  to  this  tube  by  opening  a  valve  at 
the  back  of  the  mouth.  This  valve  is  open  in  nasal  consonants,  such  as  an  "m"  sound, 
and  in  nasalized  vowels,  which  form  a  separate  class  of  phonemes  in  some  languages 
such  as  French. 

Acoustic  energy  in  speech  is  generated  in  one  of  two  ways:  by  the  action  of  Lhe  vocal 
cords  or  by  turbulence  at  a  constriction  created  by  the  tongue  or  lips  somewhere 
along  the  vocal  tract  As  I  mentioned  earlier,  sounds  excited  by  the  quasi  periodic 
activity  of  the  vocal  cords  are  said  to  be  voiced,  and  they  generally  play  a  more  impor¬ 
tant  role  in  speech  than  noise-excited  voiceless  sounds.  (All  vowels  and  many  con 
sonants,  such  as  ”1",  "m”  and  "b"  sounds  are  voiced,  while  "sh”,  "k”  and  "1"  are  exam¬ 
ples  of  voiceless  sounds.) 

Whichever  kind  of  excitation  is  used,  the  basic  spectrum  of  the  excitation  is  modified 
by  the  resonant  structure  of  the  vocal  tract.  This  resonant  structure  depends  on  the 
position  that  the  tongue,  lips  and  jaw  are  in.  It  happens  that  the  generation  of  the 
excitation  and  its  spectral  modification  by  the  vocal  tract  are  largely  independent  of 
each  other  and  can  thus  be  considered  to  a  good  approximation  as  a  source  isolated 
from,  and  leading  into,  a  linear  filter. 

The  upper  trace  of  Figure  2  shows  a  200ms  stretch  of  the  waveform  of  a  non-nasalized 
vowel  (strictly,  it  is  the  time  differenced  waveform:  differentiation  provides  a  6db  per 
octave  lift,  which  serves  to  flatten  the  long-term  spectrum  for  voiced  speech).  Notice 
that  the  waveform  consists  of  a  pattern  that  repeats  itself  at  regular  intervals.  The 
repetition  rate  is  the  rate  at  which  the  vocal  cords  come  together  -  typically  a  hun 
dred  times  a  second  for  a  man  -  while  the  repeating  pattern  itself  is  the  response  of 
the  vocal  tract  to  that  periodic  excitation. 

The  lower  trace  in  Figure  2  shows  the  excitation  with  the  effect  of  the  vocal  tract 
removed.  The  impulse  like  excitation  occurs  each  time  the  vocal  cords  come  together 
and  close  off  the  airflow  from  the  lungs.  In  the  particularly  simple  vowel  shown  here  (it 
is  in  fact  the  "neutral"  vowel  occurring  in  a  word  like  "the")  essentially  what  happens 
to  the  impulse  is  that  it  travels  from  the  larynx  to  the  lips,  where  part  of  it  is  radiated 
into  the  free  air  beyond  the  lips  and  part  is  reflected  back  towards  the  larynx  with  its 
polarity  reversed.  At  the  larynx  the  signal  is  reflected  again,  and  it  continues  to 
bounce  between  larynx  and  lips  steadily  losing  energy  by  absorption  in  the  walls  of  the 
vocal  tract,  by  transmission  and  ultimate  absorption  behind  the  vocal  cords,  and  by 
radiation  to  the  outside  world  until  the  next  excitation  impulse  comes  along  In  other 
speech  sounds  the  effect  of  the  vocal  tract  on  the  excitation  is  more  complex,  with 
reflections  occurring  at  more  places  than  just  the  larynx  and  lips.  Nevertheless,  the 
basic  structure  of  a  pattern  approximately  repeating  itself  at  approximately  regular 
intervals  is  retained. 

Figure  3  shows  the  power  spectrum  of  a  section  of  speech  waveform  like  the  one  in 
Figure  2.  The  regularly  spaced  spikes  occur  at  integer  multiples  of  the  repeat  fre 
quency  of  the  excitation  This  repeat  frequency,  corresponding  to  the  first  spike  in  the 
figure,  is  known  as  the  fundamental  frequency,  and  the  succeeding  spikes  are  har 
monies  of  the  fundamental.  The  intensity  of  the  harmonics  varies  smoothly  across  the 
spectrum  in  a  way  determined  by  Lhe  impulse  response  of  the  vocal  tract  The  peaks 
in  the  spectrum  coincide  with  resonances  in  the  vocal  tract.  They  are  known  as  for 
ma.nts. 

The  ability  to  describe  the  speech  signal  in  terms  of  an  impulse  response  and  the  fre 
quency  of  the  impulses  is  extremely  important  for  speech  recognition.  The  impulse 
response  varies  as  the  positions  of  the  tongue,  jaw  and  lips  are  changed,  while  the  fun 
darnental  frequency  depends  on  the  muscles  that,  control  the  tension  in  the  vocal 
cords  and  on  the  air  pressure  behind  the  vocal  cords  For  the  most  part,  changes  in 
the  settings  of  the  larynx  and  vocal  tract  occur  slowly  relative  to  the  frequencies 
involved  in  speech  Thus,  while  we  need  a  sampling  rate  of  at  least  eight  thousand 
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times  a  second  in  order  to  obtain  a  reasonable  digital  description  of  the  speech 
waveform,  a  description  in  terms  of  fundamental  frequency  and  a  few  parameters 
describing  the  impulse  response  typically  needs  to  be  updated  as  little  as  <■  hundred 
or  even  fifty  times  a  second. 

A  second  major  advantage  of  an  impulse  response/fundamental  frequency  description 
is  that  the  two  factors  perform  separate  linguistic  functions.  In  most  western 
languages  the  identity  of  a  word  does  not  depend  on  the  fundamental  frequency  pat 
tern  with  which  it  is  uttered.  In  some  other  languages,  such  as  Chinese  and  to  a  much 
lesser  extent  Norwegian,  the  identity  of  a  word  may  depend  on  the  fundamental  fro 
quency  pattern,  but  even  then  a  practical  recognition  strategy  must  still  separate  the 
two  factors:  the  fundamental  frequency  pattern  and  the  configuration  of  the  articula¬ 
tors  in  the  vocal  tract  are  substantially  independent  attributes  of  the  word. 

For  non-nasalized  vowels  and  some  non-nasal  consonants  the  impulse  response  of  the 
vocal  tract  is  quite  accurately  modeled  by  a  set  of  resonances  in  series,  the  important 
resonances  lying  in  the  range  300Hz  to  3kHz.  For  such  sounds,  provided  the  analysis 
is  carried  out  in  proper  synchrony  with  the  excitation,  a  technique  known  as  linear 
prediction  can  be  used  to  determine  from  the  waveform  the  frequencies  and 
bandwidths  of  the  resonances  (a  comprehensive  account  of  linear  prediction  is  given 
in  the  book  by  Market  and  Gray  [4]).  What  is  more,  the  analysis  can  go  on  to  recon¬ 
struct  the  cross-sectional  profile  of  an  acoustic  tube  that  would  have  such  a  set  of 
resonances,  and  the  profiles  can  often  show  a  close  similarity  to  the  vocal  tract 
configuration  that  produced  the  sound.  This  kind  of  analysis  of  speech  can  therefore 
be  said  to  be  strongly  production-oriented.  I  should  point  out  that  that  the  analysis  is 
approximate  in  as  much  as  the  model  of  the  excitation  by  a  sequence  of  impulses  is 
inexact,  that  the  analysis  is  inevitably  less  successful  for  certain  sounds  such  as 
nasals  where  the  simple  tube  vocal  tract  model  does  not  fit,  and  that  in  most  practical 
cases  its  accuracy  is  further  reduced  by  applying  the  analysis  at  regular  intervals 
along  the  waveform  rather  than  in  synchrony  with  the  excitation. 

We  can  now  get  back  to  considering  the  arguments  in  favor  of  approaching  speech 
recognition  from  the  point  of  view  of  speech  production.  We  can  see  that  at  the 
moderate  end,  proponents  of  production-based  analysis  could  point  out  that  it  can 
lead  to  the  generation  of  a  simple,  compact  description  of  the  speech  signal,  and, 
moreover,  one  in  which  features  that  determine  the  lexical  identity  of  a  word  are  quite 
well  separated  from  features  that  have  more  to  do  with  the  function  of  the  %vord  in  the 
sentence  or  with  mood  of  the  speaker.  Somewhat  more  speculatively,  if  we  could  carry 
out  a  production-based  analysis  well  enough  we  might  hope  to  predict  coarticulation 
phenomena  such  as  our  breab  board  example  as  well  as  other  energy-saving  shortcuts 
that  the  articulators  might  take,  such  as  the  possible  failure  of  the  tongue  to  reach  its 
target  position  for  a  vowel  separating  two  consonants  in  a  rapidly  spoken  syllable. 
Finally,  the  most  extreme  view,  embodied  in  the  Motor  Theory  of  Speech  Perception 
[5],  maintains  that  human  speech  perception  works  by  mentally  reconstructing  the 
articulator  settings  that  produced  the  speech  signal  being  heard.  According  to  this 
last  view,  which  is,  I  believe,  much  less  popular  than  it  was  fifteen  years  ago,  it  would 
be  highly  desirable  that  an  automatic  recognizer  should  also  work  by  reconstructing 
the  production  details  of  the  speech  signal  it  is  receiving.  Linear  prediction  looks  to  be 
the  best  current  possibility  for  carrying  out  such  a  reconstruction. 

Pcrcept-inn-hased  approaches 

To  some  people  it  may  seem  a  truism  to  assert  that  we  should  base  speech  recogniz¬ 
ers  on  human  speech  perception.  Others  sometimes  point  out  that  aircraft  don't  flap 
their  wings  just  because  birds  do,  so  machine  recognition  of  speech  need  not  copy  the 
human  model.  But  this  is  a  false  analogy:  first  there  was  air,  and  then  birds  and  men 
found  ways  of  flying  in  it;  1  doubt  if  anyone  would  claim  that  first  there  was  speech  and 
then  men  evolved  ears  to  listen  to  it!  Speech  is  certainly  adapted  to  our  human  capa 
city  to  perceive  it,  and  it  is  desirable  that  an  automatic  recognizer  should  be  able  to 
make  the  distinctions  that  we  can  make  and  ignore  those  we  cannot  make. 

The  problem  in  basing  a  recognition  approach  on  human  speech  perception  lies  not  in 
deciding  whether  it  is  a  good  idea  to  do  so,  but  rather  in  the  fact  that  we  know 
remarkably  few  hard  facts  about  human  speech  perception.  What  we  know  somewhat 
more  about  is  human  sound  perception  in  general.  We  know,  for  example,  that  we  are 
relatively  insensitive  to  the  phase  information  in  a  signal,  and  it  would  consequently 
make  little  sense  to  build  a  recognizer  that  tried  to  recognize  a  specific  waveform, 
since  a  slight  change  in  the  relative  phases  of  its  spectral  components  would  bo 
indetectible  to  human  ears  and  yet  it  could  cause  the  waveform  to  look  quite  different. 
For  this  reason,  all  practical  recognizers  work  with  some  representation  or  other  of 
the  short  term  power  spectrum  and  ignore  the  phase  spectrum. 

Another  known  property  of  human  sound  perception  is  frequency  masking,  the  ten 
dency  of  an  intense  tone  to  obscure  the  presence  of  a  less  intense  tone  at  a  neighbor 
ing  frequency.  It  follows  from  this  property  that  our  hearing  is  more  sensitive  to  the 
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peaks  in  the  spectrum  than  to  the  troughs,  whose  details  tend  to  be  masked  by  nearby 
peaks.  Thus  a  speech  recognizer  that  used  a  representation  of  the  spectrum  that  was 
particularly  sensitive  to  dips  in  the  spectrum  would  be  unlikely  to  work  well.  If  the 
algorithm  commonly  used  in  linear  prediction  is  viewed  as  a  means  of  characterizing 
the  short-term  power  spectrum,  it  turns  out  that  it  has  the  desirable  property  of 
doing  a  better  job  of  characterizing  the  peaks  than  the  troughs. 

An  alternative  -  and  in  fact  much  older  -  method  of  reducing  sensitivity  to  weak  tones 
when  they  are  close  to  a  strong  tone  is  to  divide  the  auditory  spectrum  into  a  set  of 
bands,  the  acoustic  power  in  the  set  of  frequencies  in  each  band  being  averaged 
together.  Masking  experiments  with  human  subjects  tell  us  how  wide  these  critical 
bands  should  be:  above  about  1kHz  the  bands  should  not  be  of  constant  width,  but 
rather  they  should  increase  in  width  in  rough  proportion  to  their  centre  frequency 
with  about  three  bands  in  each  octave.  A  channel  vocoder  uses  a  bank  of  filters  that  is 
rather  like  the  set  of  critical  bands,  and  the  same  principle  is  used  in  many  successful 
speech  recognizers.  There  is  an  interesting  conflict  between  production-based  and 
perception-based  approaches  here.  A  representation  of  speech  based  on  linear  predic¬ 
tion,  which  generally  models  speech  production  quite  well,  gives  equal  resolution  to  all 
parts  of  the.  spectrum.  Thus,  a  comparison  between  two  recognition  systems  which 
were  similar  except  that  one  carried  out  a  filter-bank  analysis  of  the  speech  and  the 
other  a  linear-predictive  analysis  would  amount  to  a  comparison  between  a 
production  modeling  approach  and  an  auditory-perception  modeling  approach.  Davis 
and  Mermelstein  [6]  carried  out  such  a  comparison  and  found  a  clear  advantage  for 
the  filter  bank.  Moreover,  among  the  parameter  sets  that  can  be  used  to  present  the 
results  of  the  linear  prediction  the  ones  that  are  best  interpreted  as  providing  a 
description  of  the  general  shape  of  the  spectrum  (i.e.  the  linear  prediction  cepstrum) 
performed  better  than  the  more  production-oriented  area  coefficients  that  would  be 
used  in  reconstructing  vocal-tract  area  functions. 

Some  researchers  [7,8]  have  gone  further  in  modeling  neural  behavior  in  the  inner 
ear,  in  some  cases  incorporating  the  superior  time  resolution  available  to  the  ear  at 
high  frequencies  where  frequency  resolution  is  poor.  Although  improved  recognition  of 
stops  and  fricatives  has  been  claimed  to  result,  the  procedure  has  not  been  widely 
adopted  because  it  is  computationally  expensive. 

If  we  now  turn  our  attention  to  speech  perception  rather  than  auditory  perception,  I 
have  to  admit  that  I  find  the  field  confusing,  and  I  do  not  feel  competent  to  make  any 
attempt  at  an  overview  of  present  knowledge.  There  is,  perhaps,  evidence  [9]  that 
speech  processing  works  in  a  "left-to-right"  fashion  (i.e.  forwards  in  time)  rather  than, 
say,  first  picking  out  stressed  syllables  and  working  outwards  from  them  in  both  direc¬ 
tions,  and  that  possibilities  for  each  word  to  be  recognized  are  considered  in  parallel 
rather  than  exploring  the  most  promising  interpretation  first  and  returning  when  it 
meets  trouble.  I  am  sure,  however,  that  both  these  statements  would  be  disputed  by 
some  specialists  in  speech  perception. 

In  1979  Klatt  published  a  long  paper  [10]  proposing  the  incorporation  of  models  of 
human  speech  perception  in  a  recognition  system.  He  subsequently  reported  experi¬ 
ments  [11]  suggesting  that  listeners  use  a  different  criterion  when  making  a  judgment 
of  the  phonetic  similarity  of  two  sounds  from  the  one  they  use  when  making  a  general 
psychophysical  comparison  of  two  sounds.  The  psychophysical  judgments  seem  con¬ 
sistent  with  the  general  spectral  shape  comparisons  carried  out  in  most  recognition 
systems,  while  the  phonetic  judgments  seem  more  dependent  on  the  frequencies  of 
energy  peaks  in  the  spectrum.  He  has  more  recently  reported  work  on  a  metric 
intended  to  correlate  better  with  human  phonetic  judgments  [12].  It  will  be  interest¬ 
ing  to  see  how  the  metric  performs  -  many  researchers  in  the  past  have  thought  it 
desirable  to  represent  speech  in  terms  of  the  frequencies  of  energy  peaks  -  formant 
frequencies  -  but  have  been  held  back  by  the  problem  that  occasional  errors  in  peak 
frequency  assignment  can  have  disastrous  results  on  performance.  The  new  metric 
avoids  making  hard  decisions  about  formant  frequencies. 

In  general,  it  is  striking  how  little  the  results  of  research  in  speech  perception  have 
influenced  the  design  of  successful  speech  recognition  systems,  though  that  does  not, 
of  course,  preclude  such  influence  in  the  future. 

Pragmatic  approaches 

The  heading  "pragmatic  approaches"  seems  at  first  sight  like  a  catch-all  under  which 
any  recognition  work  not  based  on  production  or  perception  results  can  be  placed.  To 
some  extent  it  is  just  that,  except  that  it  excludes  approaches  justified  by  introspec¬ 
tion  or  by  pet  theories  inadequately  supported  by  experimental  evidence.  I  mean  the 
term  to  be  confined  to  approaches  that  are  justified  primarily  by  the  fact  that  they 
are  found  to  work.  A  notable  example  of  such  an  approach  is  provided  by  the  various 
versions  of  the  dynamic  programming  algorithm  for  time  aligning  two  productions  of 
a  word  or  sequence  of  words  The  algorithm  will  no  doubt  be  explained  in  detail  in 
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later  sessions;  for  the  moment  all  I  want  to  point  out  is  that  it  is  central  to  a  large  pro 
portion  of  successful  recognition  systems  and  that  its  introduction  was  not  inspired  by 
production  or  perception  considerations  but  rather  by  the  fact  that  it  could  cope  with 
a  phenomenon  found  to  occur  in  the  signal,  namely,  non-linear  timing  variations 
amongst  different  productions  of  the  same  word. 

Perhaps  the  most  extreme  example  of  an  approach  based  directly  on  the  properties  of 
the  signal  itself  is  the  work  of  Jelinek’s  group  at  IBM  [13).  Instead  of  having  syntactic 
rules  supplied  by  the  system  designer,  the  system  itself  deduces  transition  probabili 
ties  between  words  from  a  very  large  amount  of  training  data.  It  then  uses  those  pro 
bability  estimates  in  attempting  to  decode  new  material.  Results  have  been  reported 
on  a  database  of  natural  English  consisting  of  a  set  of  patent  texts  concerning  lasers. 
It  constitutes  the  most  ambitious  current  attempt  at  single-speaker  speech  recogni¬ 
tion  that  1  know  of. 

One  property  that  many  of  the  more  successful  -  and  above  all,  practically  useful  -  sys¬ 
tems  share  is  simplicity .  1  suspect  it  is  no  accident  that  Harpy ,  the  only  one  of  the 
ARPA  Speech  Understanding  systems  to  meet  the  original  success  criteria,  was  dis¬ 
tinguished  from  its  competitors  mainly  by  the  fact  that  it  was  considerably  simpler. 
John  Bridle’s  successful  continuous  word  matching  algorithm  [14],  which  I  hope  he  will 
describe  to  you,  is  also  considerably  simpler  than  other  algorithms  that  have  been 
proposed  for  recognizing  word  sequences. 

Why  should  simple  approaches  be  better?  I  think  the  main  reason  is  that  they  have 
fewer  system  parameters  to  tune,  and  they  can  consequently  reach  a  better  state  of 
optimization  with  a  given  amount  of  training  data  than  could  a  more  complicated  sys¬ 
tem.  A  second,  related,  reason  is  that  in  developing  a  simple  system  a  developer  can 
more  easily  assess  the  effect  on  performance  of  a  design  decision  he  has  made  than  he 
could  in  a  complicated  system  in  which  many  rules  and  processes  interact. 

I  am  convinced  that  for  the  foreseeable  future  practically  useful  recognition  systems 
will  remain  simple  systems.  To  the  extent  that  their  design  reflects  properties  of 
human  speech  production  or  perception,  I  believe  that  the  better  ones  will  be  based 
on  solidly  established  properties  and  not  on  speculation. 


Further  reading  i 

If  anyone  is  interested  in  a  closer  look  at  phonetics,  there  are  introductory  texts  by 

Ladefoged  [15]  and  O'Connor  [16],  The  standard  work  on  the  acoustic  theory  of 

speech  production  was  written  by  Fant  [  17], 
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SUMMARY 


Digital  techniques  have  opened  quite  new  possibilities  for  processing  of  speech  signals. 
This  is  true  for  analysis  and  for  transmission.  These  new  methods  are  characterized  by  a 
strict  adaptation  to  the  very  special  pecularities  of  speech. 

The  lecture  will  give  an  overview  about  the  mathematical  possibilities  and  their  rele¬ 
vance  to  the  different  parts  of  the  speech  signal.  Efforts  to  represent  speech  in  a  dig¬ 
ital  and  more  or  less  redundancy- f ree  form  can  give  good  insight  into  all  the  charac¬ 
teristics  of  such  a  highly  complex  signal. 

Possibilities  for  representation  of  speech  signals  reach  from  the  very  simple  pulse- 
code-modulation  techniques  (PCM)  to  sophisticated  vocoders. 

The  research  work  done  for  speech  transmission  and  coding  has  prepared  the  way  for 
methods  to  recognize  and  synthesize  speech  signals.  Automatic  speech  synthesis  will  be 
an  important  tool  for  the  communication  between  man  and  machine.  The  lecture  will  give 
an  additional  introduction  into  the  techniques  of  automatic  speech  synthesis. 

1 .  INTRODUCTION 


Speech  signals  are  the  most  important  signals  in  today's  and  tomorrow's  telecommunica¬ 
tion  systems.  This  results  from  the  fact  that  human  communication  is  the  basis  of  all 
communication  systems.  This  man-to-man  communication  will  in  the  future  be  combined  with 
efficient  man-machine  communication.  About  one  part  of  this  communication,  speech  output 
by  computers,  later  in  this  paper  will  be  reported. 

Techniques  of  digital  signal  processing  have  opened  quite  new  and  exciting  ideas,  how  to 
handle  the  structure  of  speech  signals.  We  can  now  describe  quite  well  the  information- 
theoretic  content  of  the  signal.  Most  of  the  characteristics  which  are  necessary  to  de¬ 
velop  a  suitable  model  for  the  speech  signal  can  be  understood  by  the  principles  of 
natural  speech  production  /I/.  Fig.  1  shows  the  essential  parts  of  this  process.  In  the 
case  of  voiced  sounds  a  pulse  excitation  signal  is  produced  by  the  vocal  cords  which  are 
vibrating  when  the  air  stream  from  the  trachea  passes  through.  These  pulses  are 
modulated  within  the  cavities  of  the  throat,  the  mouth  and  the  nose  and  the  resulting 


signal  will  be  a  periodic  voiced  signal.  Its  characteristics,  that  means  the  sort  of 
sound  is  defined  by  the  acoustic  porperties  of  these  filtering  cavities.  Their 
properties  are  defined  by  the  geometric  dimensions  of  the  cavities  which  can  be  changed 
during  the  process  of  articulation. 


Time  domain  signal 


Excitation  pulse  Speech  signal 


Frequency  -  domain  signal 


excitation  pulses 

Articulation 

C> 

Transfer  function  of 
vocat  tract 


Spectrum  of  speech  signal 


Fig .  2  :  Time-domain  and  frequency-domain  representation  of  speech. 

Fig.  2  gives  an  overview  about  the  important  signals  in  the  process  of  speech  production 
and  speech  perception.  Fig.  2a  shows  the  already  described  time-domain  signal.  The  pitch 

of  the  excitation  pulses  Tp is  about  120  Hz  in  the  average  for  male  voices.  We  can  find 

this  period  again  in  the  scheme  of  the  speech  signal  which  is  characterized  by  higher 
frequency  waveforms,  the  so-called  formant  frequencies.  This  might  be  understood  much 

better  if  we  look  at  the  frequency  characteristics  of  such  a  speech  signal  (Fig.  2b). 

The  excitation  pulses  of  voiced  signals  have  a  line  spectrum  with  an  envelope  that  falls 
to  higher  frequencies  with  about  6  to  10  dB/octave.  The  spectral  transfer  function  of 
the  vocal  tract  has  very  strong  resonances,  the  already  mentioned  formants,  which 
characterize  the  sound  of  the  speech  signal.  During  the  process  of  articulation  the 
spectral  envelope  of  the  excitation  signal  is  modulated  by  the  transfer  characteristics 
of  the  vocal  tract,  which  has  sharp  resonance  peaks.  For  unvoiced  sounds  the  basic 
process  is  quite  similar.  The  only  difference  comes  from  the  fact  that  the  excitation 
signal  now  consists  of  a  sort  of  turbulence  noise  which  is  created  by  air,  streaming 
through  some  quite  narrow  positions  within  the  vocal  tract. 

The  process  of  speech  production  is  of  course  a  dynamic  process.  That  means  that  all  the 
mentioned  parameters  are  changed  in  a  relatively  fast  manner.  The  normal  speaking  rate 
is  about  1C  to  20  sounds  per  s  ond.  The  duration  of  different  sounds  varies  between  5 
ms  for  the  very  short  plosive  sounds  like  /t/  up  to  about  100  ms  for  slowly  spoken 
voiced  sounds  like  some  vowels.  These  timecharacteristics  of  speech  signals  are  impor¬ 
tant  for  many  speech  analysis  and  synthesis  techniques.  In  the  case  of  speech  transmis¬ 
sion  it  is  additionally  important  to  know  something  about  the  human  perception  of  speech 
signals . 

Man's  acoustic  perception  system  is  not  exclusively  dedicated  to  the  perception  of 
speech.  However  its  perception  principles  are  very  well  adapted  to  the  special  qualities 
of  speech  signals.  In  principle  the  ear  makes  a  spectral  analysis  with  additional  empha¬ 
sis  on  the  analysis  of  time-varying  signals.  This  r eans  that  there  is  a  combination  of  a 
3ort  of  very  narrow  band  spectral  analysis  for  precise  detection  of  the  formant's  mid 
frequencies  and  simultaneously  a  precise  analysis  of  time  variations  in  the  spectral 
characteristics  including  periodicity  detection  for  the  analysis  of  the  varying  line 
structure  of  voiced  speech  signals.  So  our  speech  percep  tion  appartus  is  a  highly 
sophisticated  system  with  special  adaptation  to  the  structure  of  speech  signals.  We  must 
take  care  of  these  and  many  more  facts  if  we  want  to  design  good  speech  transmission 
systems  and  want  to  produce  natural  and  intelligible  speech.  On  the  other  side  speech 
signals  have  been  optimally  adapted  to  a  sort  of  spectral  anlysis  with  quite  special 
poperties  which  is  done  within  our  human  ear  and  the  following  neural  stages  in  our 
brain.  So  technical  systems  for  speech  analysis  not  only  must  take  care  of  the 
physiological  processes  but  also  can  learn  many  things  from  these  processes.  These  ideas 
are  especially  important  for  preprocessing  and  feature  extraction  stages  in  a  speech 
processing  system. 

Fig.  3  gives  us  a  rough  overview  about  this  interaction  of  speech  transmission  and  rec- 
cogni tion /synthes is  ideas.  Every  technical  analysis  starts  with  a  sort  of  preprocessing 
by  which  some  important  signal  characteristics  are  extracted.  The  following  feature  ex¬ 
traction  is  still  a  preprocessing  stage  for  a  speech  recognition  system  but  there  are 
extracted  more  complicated  and  combinatorial  parameters,  e.g.  segmented  phoneme  parame¬ 
ters  or  prosodic  parameters  like  speech  intonation.  The  following  stages  are  concerned 
with  the  central  task  of  recognition  and  understanding.  Then  a  speech  output  is  created 
based  on  linguistic  rules.  The  phonetic  and  speech  synthesis  parts  again  handle  higher 
and  lower  level  parameters  to  produce  a  speech  signal  which  will  be  put  to  a  loudspeaker 


Fig  ■  3  '•  Relations  between  speech  transmission  and  speech  recognition  and  synthesis. 

to  make  an  acoustic  signal.  In  speech  transmission  with  redundancy  reduction  we  jump 
over  the  inner  kernel  of  this  system  from  Fig.  3  and  transmit  directly  a  parametric  des¬ 
cription  of  the  analyzed  signal  to  a  sort  of  synthesizer  which  can  reproduce  the  phys¬ 
ical  signal. 

So  it  will  be  important  to  understand  the  significance  of  different  sorts  of  preprocess¬ 
ing  in  speech  processing  by  studying  some  of  the  more  important  speech  transmission  sys¬ 
tems  today  in  use  and  to  understand  how  their  signals  can  be  useful  for  automatic  speech 
recognition  and  synthesis. 


2.  MATHEMATICAL  AND  THEORETICAL  PRINCIPLES  OF  DIGITAL  SPEECH  PROCESSING  M/,/5/ 

The  term  "speech  processing"  does  not  automatically  include  the  tern.  "digital"  but  in 
practice  today  analog  speech  processing  is  still  only  used  in  very  special  cases.  So  the 
basis  of  all  our  operations  will  be  a  sampled  and  quantized  signal.  This  means  that  the 
speech  signal  has  to  be  coded  into  a  form  of  numbers.  The  principle  of  this  pulse  code 
modulation  (PCM)  process  is  shown  in  Fig.  4  121 .  The  analog  waveform  (Fig.  4a)  is  sam- 
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Puls-code  representation  of  analog  signals. 
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pled  at  a  fixed  rate  1  / Ts  /HzJ,  normally  with  double  the  highest  frequency  which  is  in 
the  signal.  Telephone  quality  speech  has  a  bandwidth  of  about  4  kHz,  so  it  is  necessary 
to  sample  such  a  signal  with  at  least  8  kHz.  This  means  that  Tj  is  125  ;js  .  Speech  sig¬ 
nals  with  better  quality  need  a  higher  sampling  rate  up  to  20  kHz,  resulting  in  a  speech 
bandwidth  of  10  kHz.  Sampling  produces  an  amplitude  modulated  impulse  signal  (Fig.  4b). 
The  amplitude  of  every  of  these  impulses  now  is  measured  with  a  fixed  precision  and 
these  measured  values  in  a  binary  form  are  the  final  result  of  the  whole  pulse  code 
modulation  process.  We  get  a  series  of  numbers  which  still  represents  the  full  speech 
waveform  -with  some  minor  errors-  and  which  again  can  be  used  to  reproduce  the  analog 
waveform  for  presentation  of  the  speech  signal  through  loudspeaker. 

The  process  of  pulse  code  modulation  has  not  wasted  anything,  but  we  now  can  process 
speech  samples  by  number  crunching  techniques  in  fast  digital  signal  processing  systems. 
There  are  many  well  known  operations  to  get  more  intimate  knowledge  about  the  informa¬ 
tion  within  the  speech  signal.  Some  of  the  more  important  basic  operations  are  trans¬ 
forms,  correlation,  and  prediction.  As  an  example  of  a  very  special  speech-oriented  op¬ 
eration  pitch  analysis  will  be  described. 


2 . 1  Transform  Operations 

Since  almost  200  years  the  fundamental  principle  of  signal  transform  is  well  known  by 
the  Fourier  transform.  This  transform  represents  the  signal  by  description  through  har¬ 
monic  waves,  the  Sine  and  Cosine  waves.  We  call  the  result  a  Fourier  spectrum.  Fig.  2 
gives  an  example:  The  time  domain  signal  in  Fig.  2a  can  by  Fourier  transform  be  repre¬ 
sented  by  its  power  spectrum,  where  the  phase  information  is  lost  (for  speech 
intelligibility  phase  information  is  not  very  important). 

Speech  signals  normally  are  no  stationary  signals  but  they  change  their  waveforms  with¬ 
in  very  short  sections,  lasting  in  the  mean  about  20  to  30  ms.  If  we  choose  a  sample 
frequency  of  8  kHz  such  a  segment  or  block  of  20  ms  contains  160  samples  of  the  origi¬ 
nal  speech  waveform.  This  array  of  160  samples  will  be  called  a  vector  and  all  discrete 
transform  operations  can  be  interpreted  as  operations  within  an  n-dimensional  vector 
space  in  this  case  e.g.  a  160-dimensional  vector  space.  We  call  such  operations  which 
concentrate  only  on  a  well  defined  short  segment  of  a  signal  waveform  short-time  opera¬ 
tions.  This  means,  we  suppose  the  speech  signal  would  not  change  its  parameters  within 
this  short  segment  (which  is  not  really  true,  but  the  error  is  small  enough). 

The  principle  of  a  signal  transformation  can  be  easily  understood  by  the  vector  opera¬ 
tion  shown  in  Fig.  5.  Here  only  a  two-dimensional  signal  space  is  shown.  The  signal  vec¬ 
tor  xj  is  described  by  its  two  components  (x1t  x 2) ,  but  the  mathematical  principles  are 
always  valid  for  higher  dimensional  signal  spaces  too.  The  task  of  the  transformation  is 
to  transform  the  basic  set  of  values  into  a  new  transformed  space  which  would  be  better 
adapted  to  the  characteristics  of  the  signal. 

The  original  n-dimensional  signal  vector 


should  be  transformed  into  a  new  vector  Jf  by  a  linear  operation 

y  -  at(x  -&)  (2) 

T 

Here  is  a  transformation  matrix  whose  column  vectors  are  the  basis  functions  of  the 
new  coordinate  system.  The  vector  pi  adds  a  shifting  operation  by  which  the  centering  of 
the  new  coordinate  system  could  be  further  enhanced  in  respect  to  the  signal  vectors  X. 
Now  the  new  coordinates  (y 1 ,  are  much  better  suited  to  describe  the  original  vectors 
with  smaller  numbers.  The  value  range  of  a  quantizer  for  such  a  transformed  signal  can 
therefore  be  much  smaller  than  that  of  the  original  quantizer. 

One  of  the  most  important  transforms  is  the  Fourier  transform.  Here  the  new  basic  vec¬ 
tors  are  the  Sine  and  Cosine  functions.  The  original  speech  vector  after  Fourier  trans¬ 
formation  is  expressed  in  terms  of  Sine  and  Cosine  waves.  The  result  is  normally  called 
a  spectrum  or  a  frequency  domain  representation  of  the  speech  signal.  This  sort  of  rep¬ 
resentation  is  very  advantageous  because  every  linear  system  like  the  vocal  tract  pro¬ 
duces  harmonic  waves,  and  there  is  a  clear  evidence  that  the  human  ear  makes  a  frequen¬ 
cy  analysis  . 

Fig.  6  shows  such  a  digitally  computed  speech  spectrum  of  the  German  word  "sieben".  The 
frequency  axis  ranges  to  about  4  kHz  and  the  duration  of  the  digit  was  about  800  ms.  A 
new  short  time  spectrum  is  computed  every  10  ms.  Here  we  can  see  that  the  following 
spectrum  differs  only  slightly  from  the  preceding  one.  Only  when  the  explosion  of  the 
sound  /b/  happens  we  notice  a  very  fast  onset  of  this  sound  after  a  pause  in  which  the 
explosion  has  beer,  prepared.  The  spectral  energy  is  marked  by  the  darkness  of  the  dis¬ 
crete  points  and  we  can  see  that  e.g  in  the  case  of  the  sound  111  there  are  about  three 
frequency  areas  with  high  energy,  at  about  500  Hz,  2600  Hz  and  3100  Hz.  The  pattern  of 
these  formants  is  relatively  constant  during  the  sound.  On  the  other  side  the  formant 
change  from  the  sound  / ^ /  to  /n/  is  quite  well  marked.  Every  short-time  spectrum  has 
about  100  points  and  so  the  frequency  distance  between  neighbouring  points  is  about  40 
Hz.  This  is  a  frequency  distance  which  normally  is  comparable  to  the  human  ear's  fre¬ 
quency  selectivity. 
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Fig ,  5 :  Principle  of  transformation  in  vector  domain 

In  practice  the  matrix  operation  from  Eq .  2  is  done  via  a  very  efficient  procedure 
called  the  Fast  Fourier  Transform  FFT.  The  number  of  multiplications  necessary  is  about 
n*ldn,  where  n  is  the  number  of  points.  Here  we  need  about  660  multiplications  every  10 
ms,  or  one  multiplication  might  last  maximally  15  ps .  This  is  quite  a  long  time  for 
modern  signal  processors  which  can  do  this  transform  in  real  time.  For  many  applications 
in  speech  processing  100  points  are  too  much  and  so  groups  of  points  are  joined  to  make 
a  more  rough  spectral  analysis,  a  digital  variant  of  the  long  known  bandfilter  analysis. 


Short-time  spectrum 
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2.2  Autocorrelation 


We  have  seen  that  speech  signals  are  linear  superpositions  of  harmonic  waves.  This  means 
at  the  same  time  that  consecutive  samples  are  highly  correlated,  a  speech  curve  is  not 
stochastically  jumping.  But  far  beyond  this  fact  there  are  still  periodicities  in  the 
signal  which  result  from  the  periodic  excitation  of  voiced  sounds.  Such  periodicities 
can  be  easily  detected  by  autocorrelation.  Fig.  7  shows  the  principle.  The  speech 
samples  x(m)  -here  we  prefer  not  to  use  the  vector  writing-  are  delayed  for  a  varying 
number  of  samples  k  and  the  delayed  and  non-del a yed  signal  are  multiplied  to  form  the 
autocorrelation  function 

$(<)  .  m 

where  n  is  the  number  of  samples  which  the  speech  segment  contains. 
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Fig ,  7 :  Principle  of  autocorrelation  analysis. 

Of  course  we  can  do  this  operation  only  within  a  short-time  segment  because  the  speech 
characteristics  change.  The  correlation  function  0(k)  shown  in  Fig.  7a  is  near  1  at  very 
small  values  of  k.  This  means  that  neighbouring  samples  are  quite  similar.  The  next  peak 
marks  the  periodicity  of  this  voiced  speech  signal.  The  value  k„  of  the  pitch  period  can 
be  easily  found  by  peak  picking. 

The  autocorrelation  function  is  narrowly  related  to  the  power  spectrum  of  a  signal.  The 
Fourier  transform  of  the  power  spectrum  is  the  autocorrelation  function.  The  autocorre¬ 
lation  function  is  still  a  time-domain  function  and  therefore  it  gives  information  con¬ 
cerning  the  time  domain  characteristics  of  the  speech  signal. 

2 . 3  Linear  Prediction  / 3 / 

Linear  prediction  is  based  on  the  autocorrelation  characteristics  of  a  signal.  High  cor¬ 
relation  values  p(k)  mean  that  on  the  average  a  sample  x.  is  very  similar  in  its  value 
to  a  sample  x.  where  the  number  k  =(j-i).  So  in  the  mean  it  is  possible  to  estimate  the 
value  xj  fromJthe  preceding  value  x^. 
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8 :  Linear  prediction  of  signals 

a)  Principle 

b)  Recursive  prediction  scheme 


The  estimated  value  of  the  sample  x(nT)  in  Fig.  8,  where  T  is  the  sampling  period,  is 


x(*T)  (in 

l 

The  are  called  the  predictor  coefficients  and  they  are  computed  from  the  autocorrela¬ 
tion  function  by  minimization  of  the  predictor  error  between  the  estimated  and  the  real 
value  of  the  signal: 

(V/  -  =  m;*  (5) 

< 

Eq  .  5  leads  to  an  algorithm  for  calculation  of  the  predictor  coefficients  by  a  set  of 

linear  equations.  We  can  write  this  again  in  vector  form 

H  a  =  s  ( 6 ) 

where  M  is  a  matrix  consisting  of  all  the  averaged  products  x (n-i  )  • x (m-i  ) ,  a_  is  the  vec¬ 
tor  oT  the  predictor  coefficients  a^and  s^  is  the  vector  of  the  correlation  coefficients 
x(n)*x(n-i).  The  scheme  of  such  a  prediction  system  in  Fig.  8b  shows  that  the  estimated 
signal  1<  has  to  be  subtracted  from  the  original  signal.  The  prediction  error  6  then  is 
minimal  if  the  predictor  coefficients  are  well  adapted  to  the  original  signal. 

In  Fig.  9  the  original  speech  signal  and  the  resulting  error  signal  £  ate  shown.  It  can 
be  seen  that  the  error  is  maximal  when  the  excitation  pulse  starts  a  new  pitch  period. 
In  this  moment  the  free  oszillation  of  air  in  the  vocal  tract  is  interrupted  and  the 
prediction  fails. 


Fig .  9 :  Speech  signal  and  error  in  linear  prediction. 


kHz 

Fig.  10:  Inverse  filtering  of  a  speech  spectrum. 


In  the  spectral  domain  Fig.  10  interprets  linear  prediction  as  a  process  of  inverse  fil¬ 
tering.  The  transfer  characteristic  of  the  predictor  filter  is  in  a  least  square  sense 
adapted  to  the  envelope  of  the  speech  spectrum.  The  line  structure  of  such  a  voiced 
speech  signal  can  only  be  reconstructed  by  predictors  with  very  many  coefficients,  at 
least  100  coefficients.  For  the  reconstruction  of  the  spectral  envelope  like  that  i  r, 
Fig.  10  we  need  about  10  to  14  coefficients.  A  different  sort  of  predictor  for  such  a 
periodic  structure  would  be  a  comb  filter.  That  is  a  predictor  with  only  few  coeffi¬ 
cients  but  the  delay  between  the  used  speech  sample  is  equivalent  to  the  periodicity 
predicted.  Because  the  periodicity  in  the  human  voice  changes  in  a  relatively  fast  man¬ 
ner  it  will  be  necessary  to  control  the  delay  time  of  such  a  comb  filter  predictor  adap¬ 
tively,  and  therefore  it  is  necessary  to  know  the  exact  value  of  the  pitch  period. 


2 . 4  Pitch  Analysis  / 6  / 

The  algorithms  for  pitch  analysis  described  should  only  be  representatives  for  the  more 
complex  signal  processing  techniques  which  are  called  feature  extraction  techniques  in 
Fig.  3-  Such  algorithms  are  often  not  only  based  on  strict  mathematical  operations  but 
also  on  some  empirically  defined  rules.  Fig.  11  shows  first  two  examples  of  preprocess¬ 
ing  the  speech  signal  for  pitch  analysis.  The  first  one  is  the  Autocorrelation  Function 
(ACF)  already  treated  in  chap.  2.1  and  the  second  is  the  Average  Magnitude  Difference 
Function  AMDF  which  is  a  sort  of  simplified  autocorrelation  avoiding  the  multiplication 

AMDF  (k)  =  £  /  Xfm)  -  X  (m  +  k)/  (7) 
m 

This  equation  is  quite  similar  to  equation  (3).  The  most  important  difference  lies  in 
the  fact  that  the  ACF  has  a  maximum  at  its  best  periodicity  value  k  and  the  AMDF  has  a 
minimum  at  this  point  (besides  the  fact  that  AMDF  only  has  positive  values). 

Fig.  11  shows  different  examples  of  voiced  speech  signals  and  their  resulting  ACF  and 
AMDF.  Both  functions  are  only  computed  for  values  around  the  expected  pitch  period,  not 
for  very  small  values  of  k  and  not  for  very  large  ones.  Small  values  of  k  correspond  to 
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Fig ,  11:  Periodicity  analysis  based  on  Autocorrelation  Function  ACF  or  Average 
Magnitude  Difference  Function  AMDF. 

In  the  case  of  autocorrelation  analysis  the  first  peak  is  well  detectable  and  enough 
different  from  the  next  peaks  which  correspond  to  other  frequencies  but  which  are  not 
the  real  pitch.  The  AMDF  does  not  show  such  a  clear  distinction  between  the  first  and 
the  second  minimum.  Pitch  errors  could  be  possible  more  easily. 

To  avoid  pitch  errors  which  in  some  speech  coders  can  destroy  speech  quality  a  logic 
postprocessing  is  necessary.  The  basic  principle  is  to  use  a  probabili  Stic  model  which 
can  learn  from  the  history  of  pitch  contours  of  the  special  speakers  using  the  system. 
So  the  area  for  searching  maximum  or  minimum  can  be  restricted  and  the  often  possible 
octave  jumps  which  double  or  half  the  original  pitch  can  be  avoided. 


There  are  many  additional  processing  stages  necessary  if  e.g.  the  speech  signal  is  dis¬ 
torted  or  heavily  band- limi ted .  The  principal  strategy  to  detect  periodicities  always 
uses  a  sort  of  autocorrelation  or  its  variants  like  AMDF. 
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3.  SYSTEMS  FOR  DIGITAL  SPEECH  TRANSMISSION 

Long  term  research  in  digital  speech  has  led  to  a  multiplicity  of  different  techniques 
for  speech  coding  which  all  are  based  on  the  principal  algorithms  described  in  chapter 
2  but  which  of  course  possess  many  specialities.  Most  of  these  systems  only  have  scien¬ 
tific  value.  Therefore  we  will  describe  only  those  systems  which  have  a  real  practical 
significance  . 

3 . 1  Pulse  Code  Modulation  121 

This  is  the  most  important  and  the  oldest  method  to  code  and  transmit  speech  signals. 
The  basic  scheme  in  Fig.  12  is  quite  simple.  The  sampled  signal  is  quantized  as  it  has 
been  described  already  in  Fig.  4  Usual  data  for  sampling  rate  is  8  kHz  corresponding  to 
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Fig.  12:  Pulscode  modulation. 

a  voice  frequency  bandwidth  of  9  kHz  and  the  quantization  is  done  with  8  bit/sample.  So 
the  resulting  data  rate  is  69  kb/s.  That  is  quite  a  high  rate  which  can  not  be  transmit¬ 
ted  over  normal  telephone  channels  or  over  HF  charnels.  The  quantization  in  PCM  is  done 
in  a  logarithmic  manner,  small  signal  values  are  quantized  more  precisely  than  larger 
values.  In  this  way  the  signal-to-noise  ratio  SNR  remains  constant  at  a  level  of  about 
38  dB  for  a  large  dynamic  range.  This  value  is  better  than  some  degraded  analog  tele¬ 
phone  lines. 


3 . 2  Differential  Pulscodemodulation  DPCM,  Deltamodulation  111 

The  principal  scheme  of  DPCM  is  shown  in  Fig.  13.  It  is  quite  similar  to  Fig.  8  because 
DPCM  needs  a  predictor  which  in  the  most  simple  version  can  be  a  delay  for  one  sample. 
Then  the  quantizer  Q  has  to  quantize  only  the  difference  between  consecutive  samples. 
With  only  slight  degradation  it  is  then  possible  to  code  speech  signals  with  about  90 
kb/s.  Every  difference  sample  then  is  quantized  with  5  bits,  again  in  a  logarithmic 
manner . 
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Fig .  13:  Differential  Pulscodemodulation  DPCM  or  Deltamodulation. 


If  the  data  rate  of  90  kHz  is  too  high  there  are  further  possibilities  to  reduce  the  am¬ 
plitude  of  the  error  signal  by  using  a  better  predictor.  This  error  signal  can  be  quan¬ 
tized  with  3  or  9  bits/sample.  By  further  reduction  of  the  speech  quality  which  could 
only  be  done  in  commercial  or  military  applications  some  2  bits/sample  are  still  a  pos¬ 
sible  quantizer  dimension. 

The  quality  of  such  a  DPCM  system  can  be  enhanced  by  adaptively  controlling  the  predic¬ 
tor  coefficients  as  has  been  shown  in  Fig.  8.  Such  a  system  is  then  called  Adaptive  Dif¬ 
ferential  Pulse  Code  Modulation  ADPCM.  The  adaptive  control  can  help  to  make  a  16  kb/s 


system  sounding  like  a  24  kb/s-system  but  it  cannot  help  to  make  a  well  sounding  8  kb/s 
system  . 

A  slightly  different  variation  of  these  principles  is  de 1 tamodu la t i on  .  The  principal 
scheme  of  this  technique  is  identical  to  Fig.  13.  but  we  now  use  only  a  1 -Bi t-quan t i zer , 
which  makes  hardware  very  simple.  Because  a  coding  with  1  bit/sample  is  not  possible 
with  normal  DPCM,  we  must  use  a  much  higher  sampling  rate.  Through  this  method  the  dif¬ 
ferences  between  consecutive  samples  will  become  much  smaller,  and  can  so  be  quantized 
with  1  bit.  There  is  still  another  problem:  A  1-bit  quantizer  can  only  quantize  two  val¬ 
ues.  normally  0  and  1,  but  for  speech  we  need  the  values  -1  and  +1  as  speech  slopes  go 
up  and  down.  Therefore  we  must  leave  out  the  value  0  and  the  quantizer  jumps  between  -1 
and  +1.  The  waveforms  of  such  a  delta  modulator  look  like  that  in  Fig.  14a.  For  very 
fast  signal  slopes  tne  delta  modulator  cannot  follow  with  its  fixed  step  size.  The  now 
used  Continuously  Variable  Slope  Delta  modulator  CVSD  avoids  this  drawback  by  changing 
the  step  size  of  the  quantizer.  This  is  in  effect  similar  to  changing  the  predictor 
parameters  . 
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Fig ,  14:  Del tamodulation  signals 

a)  linear  deltamodulation  of  analog  signals 

b)  linear  and  adaptive  deltamodulation 


Fig.  1 4 b  shows  that  such  an  adaptation  can  have  a  faster  impulse  response  than  the 
normal  linear  deltamodulation.  During  the  last  years  much  more  sophisticated  methods 
have  been  developed  to  code  the  error  singal  in  an  adaptive  way.  This  means  to  code  and 
recreate  a  differential  signal  like  that  in  Fig.  9  but  to  transmit  only  very  few 
parameters.  All  the  methods  used  are  in  principle  similar:  The  error  signal  consists  of 
periodic  peaks  and  in  between  there  is  some  signal  which  looks  like  noise,  but  is  not 
only  noise.  Therefore  a  spectral  analysis  of  this  error  or  residual  signal  is  done,  the 
most  important  spectral  components  are  coded  and  transmitted  and  at  the  receiver  the 
residuum  might  be  reconstructed.  The  most  important  task  is  to  keep  the  periodicity 
structure  as  in  Fig.  10  undestroyed.  The  basic  scheme  of  such  a  system  is  shown  in 
Fig.  15.  Because  now  a  synthetic  error  signal  is  constructed,  this  can  be  done  by  using 
some  information  from  the  original  signal  too.  Therefore  the  analysis  can  be  done  with 
information  from  the  original  and  the  error  signal  / 8 ,  11/. 
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Fig.  15:  Baseband  or  residual  coder 


3  -  3  Transform  Coding  /9,  10 / 


Contrary  to  predictive  coding  which  is  operating  in  the  time  domain,  transform  coding 
does  the  important  operations  in  the  frequency  domain.  Fig.  16  shows  the  basic  scheme.  A 
set  of  samples  x  is  transformed  into  a  spectral  domain.  Besides  the  well  known  Fourier 
transform  a  much  simpler  but  equally  efficient  transform  has  been  introduced,  the 
discrete  Cosine  Transform  DCT.  The  basis  vectors  of  this  Transform  have  some  similarity 
with  the  Cosine  functions  from  the  Fourier  transform,  but  are  in  their  exact  shape  quite 
different.  These  "cosine"  functions  have  much  similarity  with  speech  signals  and  so  a 
representation  of  speech  with  these  functions  is  very  efficient. 
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The  sampled  speech  signal  vector  21  has  to  be  transformed  blockwise  into  the  vector  y 
which  in  a  very  sophisticated  manner  now  must  be  coded  and  transmitted.  Within  the  re¬ 
ceiver  both  operations  are  done  in  a  reverse  way  to  reproduce  a  signal  as  natural  as 
possible.  To  take  a  good  block  length  we  use  again  the  same  aspects  which  have  already 
been  important  at  predictive  coding.  A  block  should  not  be  much  longer  than  the  statio¬ 
nary  phase  of  a  speech  sound.  For  a  normal  articulation  rate  this  would  be  about  30  ms. 
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Fig .  17:  Estimation  of  the  basis  spectrum  of  a  speech  block. 

a)  Long-term  averaged  speech  spectrum 

b)  Actual  block  spectrum  and  averaged  spectrum 

c)  Estimated  basis  spectrum  of  the  block 

The  side  information  additionally  necessary  has  to  be  transmitted  over  a  special 
channel.  Of  course  it  is  also  possible  to  make  a  much  more  sophisticated  preanalysis  of 
the  spectrum  to  minimize  the  number  of  spectral  lines  which  really  must  be  coded.  Here 
e.g.  information  about  the  periodicity  can  be  included  again. 

4.  ANALYSIS  SYNTHESIS  TELEPHONY 


.  1  has  given  a  principal  scheme  how  the  speech  signal  is  produced  by  the  human  vocal 
jratus.  All  the  operations  necessary  can  be  done  with  digital  signal  processing  too. 

ewitation  function  is  separated  into  an  impulse  and  noise  function.  These  produce 
:ed  or  voiceless  sounds.  The  three  main  resonance  systems  throat  cavity,  mouth  cavity 
nose  cavity  are  rather  complex  mechanical  filter  systems.  There  is  no  principal 
nlem  to  realize  such  filters  with  electronic  means.  Thus  we  can  build  an  electronic 
sch  synthesizer  but  we  need  to  compute  the  signals  for  controlling  all  the  parameters 
:h  are  necessary  to  produce  a  naturally  sounding  and  highly  intelligible  speech  sig- 
.  These  are  the  pitch  frequency  to  control  the  pulse  frequency  of  the  impulse  genera- 
and  information  about  the  position  of  the  voiced /unvoiced  switch.  The  control  param- 
"s  for  the  articulation  cavities  can  be  taken  together  into  a  unified  filter  whose 
isfer  characteristic  can  be  handled  in  a  very  flexible  manner.  The  difficulty  there- 
;  is  not  to  realize  the  synthesizer  but  to  get  good  control  parameters  and  to  compute 
n  in  real  time.  Then  we  can  construct  an  analysis-synthesis  system  for  speech  trans- 
si  on  .  a  vocoder . 
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Fig .  1 8 :  Linear  predictive  coder  LPC. 

These  both  generators  are  alternativley  switched  to  the  synthesizer  filter  by  a  voiced/ 
unvoiced  control  signal.  This  excitation  signal  works  like  the  error  signal  in  an  ADPCM 
coder,  but  is  a  quite  synthetic  signal.  The  synthesizer  filter  is  a  recursive  predictor 
filter  whose  transfer  characteristic  can  be  controlled  like  that  shown  in  Fig.  10.  Some 
resonances  can  be  produced  which  modulate  the  flat  spectral  envelope  of  the  excitation 
signal's  spectrum.  So  the  resulting  speech  spectrum  has  the  usual  formants.  The  last 
stage  in  synthesis  makes  an  adaptive  control  of  speech  energy  before  the  d igi tal -to-ana • 
log  converter  remakes  the  analog  signal. 
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To  realize  such  an  LPC  vocoder  with  a  universal  signal  processor  makes  very  fast  di 
tal  technology  necessary  because  every  second  some  hundred  thousand  multiplications 
adds  are  necessary  for  the  analysis  part  and  the  synthesis  part.  All  these  cal  ct 
tions  have  to  be  done  with  at  least  16  bit  accuracy.  A  first  model  of  such  a  vocoder 
shwon  in  Fig.  19.  It  can  transmit  speech  with  a  bit  rat  of  2900  b/s,  a  data  rate  t 
can  be  transmitted  over  practically  all  today  existing  communication  channels.  Vococ 
are  in  military  use  for  encrypting  the  digital  bit  stream.  Analog  speech  signals  car 
be  encrypted,  they  only  can  be  scrambled,  a  technique  by  which  secure  voice  transmiss 
is  not  possible.  Because  the  speech  quality  of  su^h  LPC  vocoders  is  quite  good,  t 
will  receive  wide  acceptance  in  the  next  years  for  commercial  and  military  use. 


Fig .  19:  Hardware  realization  of  an  LPC  vocoder  terminal. 

5.  SPEECH  OUTPUT  SYSTEMS  /1 3/ 

The  LPC  vocoder  has  shown  that  it  is  possible  to  produce  high  quality  speech  signals 
with  electronic  means.  Therefore  vocoders  became  important  not  only  for  speech  transmis¬ 
sion  but  for  speech  output  from  computers  and  for  voice  messaging  where  speech  signals 
have  to  be  stored  and  reproduced  on  demand.  The  most  simple  systems  for  speech  output 
are  announcement  systems.  In  the  last  years  very  flexible  inquiry  systems  have  been  in¬ 
troduced  where  people  can  get  information  via  telephone,  e.g.  about  railway  or  airline 
departures.  The  speech  signals  which  have  to  be  produced  in  such  a  system  can  be  based 
on  prestored  words  or  sentences  or  the  system  can  create  quite  new  speech  signals  from 
basic  knowledge  about  speech  production. 

The  first  sort  of  speech  output  systems  are  half-synthetic  / 1 4  / .  Such  systems  consist  of 
the  the  blocks  shown  in  Fig.  20. 


The  text  which  should  be  spoken  has  first  to  be  analyzed.  This  is  done  not  directly  with 
the  speech  analysis  system  but  by  a  text  analyzer.  Words  and  phrases  which  should  be 
combined  from  the  stored  segments  have  to  be  identified  and  the  combinatorial  rules 
which  define  the  later  necessary  control  of  prosodic  parameters  must  be  fixed.  Then  the 
vocabulary  has  to  be  spoken  and  analyzed  on  its  LPC  parameters.  These  parameters  can  be 
stored  and  the  quality  of  the  speech  might  be  directly  tested  by  synthesizing  the  in¬ 
tended  speech  signals.  Especially  the  combinations  have  to  be  tested  to  verify  the 
naturalness  of  rhythm  and  melody  in  the  final  system.  The  speech  synthesizer  today  is 
always  an  LPC  synthesizer.  Former  used  channel  vocoders  or  analog  speech  concatenators 
cannot  produce  speech  with  a  high  quality. 

The  editing  stage  must  not  be  connected  directly  to  the  system  as  in  Fig.  20,  but 
sometimes  it  is  practical  if  there  is  a  possibility  to  change  the  vocabulary  through  the 
user  and  so  it  would  be  necessary  to  integrate  some  new  words  into  the  system. 
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stems  are  often  used  as  public  announcement  sys- 
ement  like  airport  information  systems  for  air- 
nsmitted  via  telephone  or  radio  channels.  Some- 
aneously  many  output  channels  for  many  customers 
n  a  system  like  that  in  Fig.  21  can  be  helpful, 
rge  data  base  and  a  multiplexer  and  control  unit 
nded  announcements  for  the  special  customer.  The 
tion  he  wishes  through  the  telephone  dial  or  use 
etic  systems  of  course  have  a  limited  vocabulary 
rt .  Therefore  the  ultimate  speech  output  systems 
system  / 1 5 / . 
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Fig .  2  1:  Multiplex  speech  output. 

The  basic  structure  of  a  text-to-speech  system  is  shown  in  Fig.  22.  The  text  input  first 
has  to  be  segmented  into  its  basic  elements.  This  of  course  is  very  language  depen¬ 
dent.  In  German  e.g.  there  are  a  large  number  of  compound  words.  In  English  words  are 
only  concatenated  to  build  a  composite  unit.  For  English  it  could  be  satisfying  to  use  a 
large  vocabulary  with  all  the  phonetic  transcriptions  of  every  word  including  informa¬ 
tion  about  the  prosodic  parameters  like  stress  or  melody,  in  German  this  is  not  possible 
because  stress  changes  dependent  from  the  word  combinations.  From  this  linguistic-phone¬ 
tic  processor  we  get  out  a  precise  description  of  the  articulatory  parameters.  A  human 
speaker  knowing  all  these  agreements  should  be  able  to  speak  the  text  perfectly  even  if 
he  would  not  know  the  language. 


22:  Principle  of  text-to-speech  synthesis 


The  next  stage  in  the  processing  knows  all  the  rules  for  articulati 
citly  known  to  the  human  speaker  by  a  long  term  use  of  his  arti 
This  stage  knowshow  a  sound  changes  if  a  transition  from  this  sound  t 
be  made.  With  all  this  knowledge  this  stage  calculates  the  parame 
final  speech  synthesizer  which  is  again  an  LPC  synthesizer,  con 
voiced /unvoiced  and  LPC  coefficients.  A  text-to-speech  system  giv 
vocabulary.  The  most  serious  drawback  is  that  it  can  only  be  used  for 
this  is  a  common  problem  in  speech  output.  In  half-synthetic  output 
sible  to  concatenate  or  store  very  flexibly  different  languages  but 
easily  possible  to  change  this  vocabulary.  In  the  text-to-speech  sy 
cabulary  changes  are  easy  but  language  changes  are  not  possible  i 
multilingual  by  its  construction  / 1 7  / .  Here  much  research  work  has 
develop  a  really  well  sounding  mutlilingual  system. 
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6.  FUTURE  SYSTEM  ASPECTS 


Speech  coding  and  transmission  will  in  commercial  telecommunication  systems  more  and 
more  be  integrated  into  speech  recognition  and  synthesis  systems.  This  enables  not  only 
the  normal  man-to-man  communication  but  also  a  flexible  integration  of  data  input  and 
output  from  EDP  systems.  A  simplified  version  of  Fig.  3,  with  emphasis  on  the  telecommu¬ 
nication  and  data  processing  aspect  can  make  this  more  clear  in  Fig.  23.  One  very  impor¬ 
tant  aspect  of  ideas  for  total  voice  systems  must  integrate  transmission  techniques. 
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Fig.  23:  Speech  processing  and  telecommunication. 


Speech  coding  for  transmission  has  prepared  many  of  the  important  parameter  and  feature 
processing  techniques  necessary  to  recognize  and  synthesize  speech  signals.  Speech 
coding  will  in  the  future  too  bring  deeper  knowledge  about  the  important  characteris¬ 
tics  of  speech  signals  because  human  judgement  about  speech  quality  always  is  very  crit¬ 
ical.  Speech  analysis  and  synthesis  can  learn  from  that. 

Another  important  aspect  is  that  speech  coding  techniques  have  prepared  efficient  dig¬ 
ital  processing  systems  working  in  real  time.  The  same  or  slightly  modified  processors 
can  be  used  in  recognition  and  synthesis  of  speech.  So  both  techniques  can  learn  and 
profit  from  each  other  to  promote  the  total  voice  system. 
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SUMMARY 


This  lecture  is  Intended  to  provide  a  brief  insight  into  some  of  the 
algorithms  that  lie  behind  current  automatic  speech  recognition  3ystem3.  It 
is  noted  that  early  phonetically  based  approaches  were  not  particularly 
successful,  due  mainly  to  a  lack  of  appreciation  of  the  problems  involved. 
These  problems  are  summarised,  and  various  recognition  techniques  are  reviewed 
in  the  context  of  the  solutions  that  they  provide.  It  is  pointed  out  that  the 
majority  of  currently  available  speech  recognition  equipments  employ  a 
'whole-word'  pattern  matching  approach  which,  although  relatively  simple,  has 
proved  to  be  particularly  successful  in  its  ability  to  recognise  speech.  It 
is  shown  how  the  concept  of  'time-normalisation'  plays  a  central  role  in  this 
type  of  recognition  process  and  a  family  of  such  algorithms  is  described  in 
detail.  In  particular,  it  is  shown  how  the  technique  of  'dynamic  time 
warping'  is  not  only  capable  of  providing  good  performance  for  isolated  word 
recognition,  but  how  it  may  also  be  extended  to  the  recognition  of  connected 
speech  (thereby  removing  one  of  the  most  severe  limitations  of  early  speech 
recognition  equipment).  It  i3  also  demonstrated  how  word  sequence  information 
can  be  used  to  increase  the  performance  of  both  isolated  and  connected  word 
recognisers.  Finally,  a  pair  of  techniques  are  presented  which  address  the 
specific  problems  faced  by  systems  which  are  to  be  used  by  more  than  one 
speaker,  or  in  noisey  environments.  It  is  concluded  that,  although  current 
speech  recognition  algorithms  are  still  relatively  unsophisticated,  they 
nevertheless  exhibit  a  level  of  performance  which  can  be  useful  in  a  wide 
range  of  well  constrained  task  environments. 


INTRODUCTION 


It  is  now  thirty-one  years  since  the  first  paper  to  describe  a  technique  for 
recognising  spoken  words  was  published  [1].  Since  that  time,  many  different 
techniques  have  been  proposed,  ranging  from  the  ridiculously  simple,  to  the 
dreadfully  complicated.  In  the  early  yea-.s,  researchers  typically  followed  a 
traditional  pattern  recognition  approach,  believing  that  speech  was  a  highly 
redundant  signal  containing  a  sequence  of  invariant  information  bearing 
elements  called  phonemes.  The  classical  early  speech  recogniser  thus  took  the 
form  of  a  pre-processor,  to  selectively  reduce  the  amount  of  data  present,  a 
feature  extractor,  typically  to  identify  formant  peaks,  a  segmentor,  to  divide 
the  signal  into  phonemic  segments,  and  a  classifier,  to  recognise  the 
individual  phonemes  from  their  features  (see  figure  1).  Discovering  which 
word  was  spoken  was  then  simply  a  matter  of  looking  up  the  sequence  of 
recognised  phonemes  in  a  kind  of  dictionary. 


Figure  1:  Typical  structure  of  an  early  automatic  speech  recogniser. 


Schemes  of  this  type  abounded  in  the  fifties  and  sixties  but,  for  reasons 
which  should  be  apparent  from  Dr.  Hunt's  lecture  on  'The  Speech  Signal',  they 
were  all  doomed  to  failure. 

The  reasons  why  automatic  speech  recognition  is  not  such  a  s t ra i ght f orward 
endeavour  as  one  might  imagine  may  be  summarised  under  four  main  problem  areas: 

First,  the  speech  signal  is  normally  continuous,  that  is  to  say,  there  are  no 
pauses  between  the  words  in  a  3poken  sentence,  nor  are  there  any  other 
acoustic  markers  which  identify  where  the  word  boundaries  might  be.  For 
example,  figure  2  shows  a  speech  spectrogram  of  the  phrase  "we  were  away  a 
year  ago";  the  only  pause  in  this  sentence  is  the  middle  of  the  "g"  in  "ago"  I 
Consequently,  techniques  for  recognising  speech  automatically  must  be  somehow 
able  to  spot  words  embedded  within  a  surrounding  sentence. 
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Figure  2:  Speech  spectrogram  of  the  phrase  "we  were  away  a  year  ago”. 


Second,  speech  signals  are  highly  variable.  One  person's  voice  is  quite 
different  to  another's  due  to  differences  in  age,  sex  or  accent.  Even  for  a 
given  speaker,  his  voice  will  be  different  on  different  occasions;  sometimes 
he  will  speak  loudly,  sometimes  softly,  sometimes  a  whisper,  or  he  might  speak 
fast  or  slow,  or  he  might  even  have  a  cold  or  be  tense.  All  these  factors, 
and  more,  may  affect  a  person's  voice.  In  fact,  even  if  a  person  tries  very 
hard,  it  is  virtually  impossible  for  him  to  say  the  same  word  in  exactly  the 
same  way  on  two  different  occasions.  For  example,  figure  3  shows  the  word 
"helicopter"  spoken  three  times  by  the  same  speaker;  note  how  the  patterns  are 
similar,  but  not  identical.  Also,  since  speech  is  continuous,  adjacent  words 
affect  each  other  to  the  extent  that  their  beginnings  and  ends  can  change 
quite  significantly.  For  example,  the  phrase  "bread  and  butter”,  if  3poken 
quickly,  may  become  "bread'n  butter",  or  "breb'm  butter"  or  even  "bre'm 
butter”'.  The  problems  of  variablity  can  therefore  be  characterised  as  those 
conditions  which  cause  speech  patterns  which  one  would  like  to  be  the  same  to 
in  fact  be  quite  different.  Consequently,  one  requires  techniques  which  are 
capable  of  dealing  with  patterns  which  are  similar,  but  not  identical. 
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Figure  3s  Three  spectrograms  of  the  word  "helicopter". 


The  third  problem  area  is  ambiguity.  This  is  characterised  by  those 
conditions  whereby  patterns  which  one  would  like  to  be  different,  end  up 
looking  the  same.  For  example,  there  is  no  acoustic  difference  between  "to", 
"two"  and  "too".  Similarly,  "grey  tape"  sounds  exactly  the  same  as  "great 
ape"!  The  implication  here  is  that  one  needs  techniques  which  are  able  to 
decide  on  the  identity  of  a  particular  word  after  first  taking  into  account 
the  identities  of  the  surrounding  words. 

The  fourth  problem  area  results  from  the  fact  that  the  speech  signal  is,  of 
course,  a  part  of  the  complex  system  of  human  language.  Consequently,  it  is 
often  the  intention  behind  a  message  that  is  more  important  than  the  message 
Itself.  That  is,  one  might  want  a  system  to  correctly  understand  a  message, 
rather  than  recognise  each  individual  word  accurately.  For  example,  the  most 
useful  answer  to  the  question  "Can  you  tell  me  the  time?"  is  "10.15"  not  "Yes, 
I  can".  Therefore,  an  advanced  speech  recognlser  would  be  expected  to 
incorporate  techniques  which  would  enable  it  to  use  the  meanings  of  words  in 
order  to  interpret  what  has  been  said. 

This  lecture  is  going  to  concentrate  on  techniques  which  have  been  found  to  be 
particularly  successful  for  tackling  the  first  two  problem  areas,  namely 
continuity  and  variability.  Techniques  in  the  other  areas  are  the  subject  of 
current  research,  and  have  not  yet  found  their  way  into  commercial  products. 
The  techniques  which  will  be  presented  here  have  found  practical  use,  but  they 
too  are  the  subject  of  continuing  research.  We  are  still  a  long  way  from 
having  all  the  answers  to  any  of  the  problems  described  above. 
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ISOLATED  WORD  RECOGNITION 


In  order  to  make  automatic  speech  recognition  a  practical  reality,  it  is  first 
necessary  to  overcome  the  continuity  problem.  The  technique  for  solving  this 
is  very  simple;  tell  the  speaker  that  he  must  put  artificial  pauses  between 
his  words,  thereby  sacrificing  naturalness  in  favour  of  greatly  simplifying 
the  recognition  process.  Since  the  positions  of  the  words  in  such  a  sentence 
can  now  be  determined  fairly  easily,  it  is  then  just  a  question  of  recognising 
each  word  individually.  This  technique  became  known  as  'isolated  word 
recognition',  and  machines  that  use  the  technique  are  called  'isolated  word 
recognisers ' . 

It  has  already  been  pointed  out  that  the  phonetic  approach  to  speech 
recognition  is  too  difficult  at  present,  hence  most  successful  techniques  for 
isolated  word  recognition  use  the  following  principle  to  recognise  the 
individual  words:-  A  word  to  be  recognised  is  compared  with  a  set  of 
pre-stored  reference  words  (often  called  ’templates'),  and  whichever  stored 
word  is  found  to  be  most  similar  to  the  unknown  word  determines  the 
recognition  result.  The  scheme  is  referred  to  as  'whole  word  pattern 
matching'.  Figure  4  illustrates  the  idea;  a  pre-proceasor  turns  the  speech 
waveform  into  some  other  useful  representation  (such  as  a  sequence  of  spectra, 
or  LPC  coefficients),  a  segmentor  isolates  each  word  by  using  the  silences 
between  them  (a  technique  known  as  'endpoint  detection’  [2]),  and  then  a 
comparison  module  compares  the  unknown  words  with  each  of  the  templates,  and 
outputs  the  results.  Before,  anyone  can  use  such  a  recogniser,  it  first  has 
to  be  given  the  reference  templates,  and  this  process  is  known  as  'training 
the  machine'.  Each  word  is  spoken  in  turn,  passing  through  the  pre-processor 
and  the  segmentor  in  the  same  way  as  for  recognition,  and  then  the  individual 
reference  word  patterns  are  stored  away  inside  the  machine. 


best  match 


Figure  *t :  Structure  of  a  typical  isolated  word  recogniser. 


Such  a  machine  will  only  work  if  the  pattern  for  a  word  to  be  recognised  is 
sufficiently  similar  to  the  reference  pattern  for  the  same  word  inside  the 
machine.  However,  it  has  already  been  pointed  out  that  the  variability  in 
speech  is  such  that  this  might  not  be  the  case.  Hence  to  overcome  a  major 
variability  problem,  the  differences  between  speakers,  it  is  usual  for  such 
recognisers  to  be  trained  on  a  single  speaker;  the  person  who  intends  to  use 
the  machine.  For  the  same  reasons,  performance  is  best  if  the  user  trains  the 
machine  immediately  before  he  intends  to  use  it.  Such  systems  are  referred  to 
as  being  'speaker-dependent'. 

The  key  to  the  success  of  this  simple  approach  to  speech  recognition  lies  in 
the  comparison  process.  It  should  already  be  obvious  that  an  absolute 
comparison  cannot  be  used,  but  that  some  sort  of  correlation  process  is 
required.  However,  even  this  is  not  sufficient  since,  having  eliminated 
speaker  variability  by  using  only  one  speaker,  the  major  outstanding  source  of 
variability  is  that  the  same  word  is  very  rarely  the  same  length  on  different 
occasions.  For  example,  in  figure  3  it  can  be  seen  that  the  three  versions  of 
the  word  "helicopter"  all  have  different  lengths.  Consequently,  the  patterns 
which  need  to  be  compared  may  be  different  sizes,  and  this  is  a  problem  for  a 
simple  correlation  technique. 

The  solution,  therefore,  is  to  ' t ime -norma  1 1 se '  each  word  such  that  all  words 
have  the  same  length.  In  practice,  the  timescale  of  a  particular  word  is 
treated  as  if  it  were  made  of  rubber,  and  the  pattern  is  stretched  or 
compressed  to  the  standard  length.  In  the  simplest  schemes  this  is  done 
uniformly  along  the  length  of  the  pattern  such  that  if  a  word  has  to  be 
doubled  in  length,  each  part  of  the  word  is  doubled.  Hence  this  technique  is 
known  as  'linear  t ime -normal 1  sat  ion ' . 

Figure  5  illustrates  the  process  on  a  pair  of  utterances  of  the  word 
"helicopter".  The  two  original  patterns  are  shown  at  right  angles  to  each 
other  so  that  the  two  timescales  can  be  compared.  It  is  clear  that  the 
vertical  utterance  is  much  longer  than  the  horizontal  one.  The  rectangle  on 
the  right  is  prescribed  by  the  lengths  of  the  two  words,  and  the  diagonal  line 
is  the  linear  time  normalisation  relationship  between  the  two.  The  third 
pattern  is  the  result  of  stretching  the  horizontal  one  to  the  same  length  as 
the  vertical  one.  It  can  be  seen  that  the  two  vertical  patterns  are  more 
similar  than  the  two  original  patterns,  hence  the  usefulness  of  the  technique. 
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Figure  5:  Demonstration  of  linear  time-normalisation. 


To  calculate  the  actual  similarity,  it  is  common  to  compare  each  of  the 
individual  frames,  or  in  this  case  spectra,  using  some  kind  of  distance 
calculation,  and  then  to  sum  these  over  the  entire  pattern.  So  for  two  speech 
patterns  V  (vertical)  and  H  (horizontal)  a  similarity  measure  might  take  the 
form : 

I  J 

0  =  H  C  Z  (  V(i,  j)  -  H  ’  < i  ,  j  >  )2  ) 
i=l  J=1 

where  h'  Is  the  t ime-normal i sed  version  of  H,  I  is  the  length  of  V  and 
H'  in  frames,  J  i3  the  number  of  parameters  in  each  frame,  and  D  is  the 
distance  between  the  two  patterns.  If  D  is  zero,  then  the  t ime-normal  1  sed 
patterns  are  identical.  Typically,  I  might  be  chosen  such  that  the  normalised 
patterns  are  1/2  second  long,  and  J,  for  a  filter  bank  pre-processor,  might  be 
somewhere  between  8  and  20  channels.  For  LPC  parameters  it  would  be  usual  to 
employ  the  Itakura  metric  in  place  of  the  sum3  of  squares  to  calculate  the 
distance  between  frames  [3]. 

A  number  of  commercial  i 30la t ed -wo rd  recognisers  have  been  produced  which 
incorporate  these  techniques.  However,  individual  machines  will  not  be 
reviewed  here,  the  intention  is  merely  to  give  an  overview  of  the  principles 
involved  . 

The  performance  of  the  algorithm  can  be  quite  useful  if  the  number  of  words 
which  the  machine  has  to  distinguish  between  (the  'vocabulary')  is  kept  small, 
10  to  30  words  for  example.  For  the  ten  digits  "zero"  to  ''nine1',  one  could 
expect  recognition  accuracies  up  to  about  97t  under  ideal  conditions.  The 
actual  performance  obtained  will  depend,  amongst  other  things,  on  the 
consistency  of  the  speakers,  the  exact  nature  of  the  pre-processing,  and  the 
number  of  training  examples  allowed  per  word.  This  level  of  performance, 
whilst  not  perfect,  has  proved  suffiently  good  to  allow  machines  of  this  type 
to  be  used  in  fairly  simple  applications,  examples  of  which  will  be  described 
in  later  lectures. 

For  larger  vocabularies,  the  recognition  accuracy  obtained  using  linear  time- 
normalisation  can  drop  significantly,  so  low  in  fact  that  practical  use  is  out 
of  the  question.  The  reason  for  this  is  that  linear  normalisation  is  not  a 
very  gooj  rodel  of  what  happens  when  people  make  words  longer  or  shorter.  In 
practice,  what  actually  happens  is  that  some  sounds  are  changed  more  than 
others.  For  example,  if  you  listen  to  yourself  say  the  word  "three",  first 
fast,  and  then  slow,  you  can  hear  that  the  "-ee"  changes  length  more  than  the 
other  sounds.  This  effect  is  apparent  in  figure  5.  Although  linear  time- 


normalisation  has  made  the  patterns  the  same  length,  it  has  still  not  made 
them  particularly  similar  to  each  other.  By  eye  one  can  see  that  the  patterns 
have  similar  structures,  and  one  can  imagine  that  by  distorting  the  timescale 
of  the  horizontal  utterance  non- 1 i nea rly ,  it  could  be  made  much  more  like  the 
vertical  utterance.  Figure  6  shows  exactly  this,  the  line  in  the  rectangle  is 
no  longer  linear  and  the  horizontal  pattern  is  distorted  accordingly.  This 
result  was  achieved  by  a  person  deciding  which  parts  of  the  word  required 
lengthening  and  which  parts  needed  shortening. 
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Figure  6;  Demonstration  of  non-linear  t ime -norma li sat  ion . 


This  technique  is  known  as  'non-linear  time-normalisation',  and  it  can  be  seen 
that,  by  improving  the  comparison  process,  the  performance  of  an  isolated-word 
recogniser  may  be  raised. 


Of  course  it  is  necessary  to  find  the  non-linear  distortion  automatically, 
rather  than  by  hand  (as  in  figure  6)  and  this  presents  a  rather  difficult 
computational  problem.  Obviously,  there  are  many  millions  of  possible 
distortions,  that  is,  there  are  many  possible  lines  across  the  rectangle 
between  the  two  timescales.  However,  rather  than  search  all  the  possible 
distortions  in  turn  (potentially  a  very  time  consuming  process)  it  is  possible 
to  apply  the  mathematic  technique  of  'dynamic  programming’.  Figure  7  shows 
the  result  of  using  dynamic  programming  on  this  particular  pair  of  words. 
Note  how  similar  the  original  vertical  utterance  is  to  the  non-linearly 
distorted  version  of  the  horizontal  distance.  Since  dynamic  programming  is 
guaranteed  to  find  the  best  possible  distortion,  this  result  is  'optimal  non¬ 
linear  time  normalisation'. 


Figure  7:  Optimal  non-linear  t ime -norma  1 1 sa t ion  using  dynamic  programming 
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Optimal  non-linear  time-normalisation,  or  'dynamic  time  warping’  ( DTH )  as  it 
has  become  known,  is  still  computationally  quite  expensive  in  comparison  with 
linear  t ime- norma  1  i sat  ion ,  but  it  is  still  the  most  efficient  way  of  getting 
the  required  answer,  and  the  result  is  guaranteed  to  be  the  best.  Once  the 
distortion  has  been  made,  then  a  distance  between  the  two  t ime-norma li sed 
patterns  may  be  calculated  as  described  earlier.  The  actual  technique  of  DTW 
will  be  described  later  in  the  lecture. 

To  illustrate  how  the  technique  is  used  in  practice,  figure  8  shows  an  example 
of  isolated  word  recognition  using  dynamic  time  warping.  In  the  example  there 
are  three  reference  patterns,  the  digits  "one",  "two"  and  "three",  shown 
vertically.  The  horizontal  utterance  is  the  word  to  be  recognised,  actually  a 
"two".  The  unknown  word  is  compared  with  the  three  reference  patterns  using 
the  DTW  technique,  and  the  resulting  three  non-linear  time  distortions  are 
shown.  Also  shown  are  the  numbers  which  are  the  distances  between  the  unknown 
and  each  of  the  reference  patterns.  The  best  match  is  determined  by  the 
smallest  distance  (the  highest  similarity).  Hence  the  unknown  word  is 
recognised  correctly  as  "two". 
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"three" 


Figure  8:  Isolated  word  recognition  using  dynamic  time  warping. 


To  interpret  the  non-linear  distortions,  it  should  be  noted  that  when  matching 
two  words  which  are  the  same,  such  as  in  figure  7  and  the  correct  match  in 
figure  8,  the  distortions  tend  to  be  subtle  non-linear  variations  on  a  linear 
theme.  On  the  other  hand,  when  two  words  are  different,  such  as  the  two 
incorrect  matches  in  figure  8,  the  distortions  tend  to  be  grossly  non-linear. 
This  is  because  it  takes  a  very  severe  distortion  of  the  timescales  of  two 
different  words  to  make  them  even  remotely  similar. 

In  practice  it  is  possible  to  have  more  than  one  reference  pattern  per  word. 
This  enables  more  variability  in  pronunciation  to  be  captured  and  the 
performance  will  be  improved.  Similarly,  some  training  procedures  involve 
averaging  different  examples  to  obtain  a  suitable  reference  pattern.  The  Bell 
laboratories  'robust'  training  procedure  is  a  hybrid  of  the  two,  combining 
averaging  with  a  statistical  clustering  procedure  [4], 

Since  the  dynamic  time  warping  technique  is  able  to  provide  a  far  more 
realistic  compensation  process  than  linear  t ime-no rma li sa ti on ,  the  performance 
of  isolated  word  recognisers  based  on  DTK  is  significantly  better.  Greater 
variability  (in  length)  can  be  accomodated,  hence  larger  vocabularies  are 
possible.  Typically,  for  the  ten  digits,  one  could  expect  recognition 
accuracies  greater  than  99*  (remembering  that  we  are  still  talking  about 
speaker  dependent  isolated-word  machines). 


DYNAMIC  TIME  HARPING 


As  has  already  been  stated,  dynamic  time  warping  is  based  around  the 
mathematical  technique  of  dynamic  programming  (DP).  This  technique  is  a 
sequential  optimisation  process  whereby  many  local  optimisation  decisions  are 
combined  in  order  to  find  a  globally  optimal  solution  to  a  problem.  The 
process,  as  it  applies  to  dynamic  time  warping,  can  be  readily  understood  with 
reference  to  a  few  diagrams. 

Dynamic  time  warping  is  essentially  a  two-stage  process.  Figure  9  illustrates 
the  first  stage.  Two  abstract  speech  patterns  are  shown,  one  vertically  and 
one  horizontally.  Each  pattern  has  time  frames  consisting  of  three  parameter 
channels,  the  vertical  pattern  has  four  frames,  the  horizontal  has  five.  The 
matrix  in  the  centre  is  known  as  the  'distance  matrix'  and  it  contains  numbers 
which  correspond  to  the  distances  between  each  frame  in  one  pattern  and  each 
frame  in  the  other  pattern.  For  example,  the  number  ”20”  in  the  top  right 
hand  corner  indicates  that  the  first  frame  of  the  vertical  pattern  is  quite 
different  to  the  last  frame  of  the  horizontal  pattern.  Similarly,  the  "1"  in 
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row-2  column-2  indicates  that  the  second  frames  of  each  pattern  are  very 
similar.  The  distances  are  actually  calculated  by  taking  the  sum  of  the 
squares  of  the  differences  in  each  parameter  channel  for  each  pair  of  frames. 


Figure  9:  Dynamic  time  warping:  distance  matrix. 


The  creation  of  the  distance  matrix  is  thus  the  first  stage.  The  second  3tage 
is  to  find  a  'path'  through  the  distance  matrix  from  the  top  left  hand  corner 
to  the  bottom  right  hand  corner  which  has,  along  its  length,  the  minimum  sum 
of  distances.  This  path  is  the  required  non-linear  relationship  between  the 
two  timescales  for  these  patterns.  In  other  words,  the  basic  function  of 
dynamic  time  warping  is  to  find  the  least-cost  distortion  of  two  patterns  in 
order  to  make  them  look  like  each  other. 

The  procedure  for  finding  the  best  path  out  of  all  the  possible  paths  is  where 
the  dynamic  programming  comes  in,  and  it  involves  the  successive  application 
of  a  'local  decision  function'  to  the  distance  matrix  in  order  to  construct  a 
'cumulative  distance  matrix'.  Figure  10  illustrates  the  process. 


Figure  10:  a)  Local  decision  function,  b)  partially  filled  cumulative 
distance  matrix,  c)  completed  cumulative  distance  matrix,  and  d)  decison 
matrix  . 


The  local  decision  function  is  shown  in  figure  10(a).  This  is  a  three  way 
decision  function  which  says  that  a  path  may  arrive  at  any  particular  point 
either  vertically,  diagonally  or  hori zonta 1 1 y  .  So,  for  any  point  in  the 
cumulative  distance  matrix,  the  smallest  cost  of  getting  to  that  point  is  the 
minimum  of  the  costs  of  getting  to  the  three  previous  points.  However,  it  is 
also  necessary  to  take  into  account  the  cost  of  being  at  a  particular  point  in 
the  first  place,  and  that  is  the  number  in  the  corresponding  place  in  the 
distance  matrix  (figure  9). 


Figure  10(b)  shows  the  cumulative  distance  matrix  in  the  process  of  being 
filled  in.  The  "?"  indicates  the  point  being  considered,  and  the  three 
previous  points  are  highlighted.  The  cost  of  getting  to  the  point  is  the 
minimum  of  19,  8  or  13,  and  the  cost  of  being  at  that  point  is  11  (from  the 
distance  matrix).  Hence  the  number  entered  into  the  cumulative  distance 
matrix  is  19  (8*11). 

Figure  10(c)  shows  the  cumulative  distance  matrix  completely  filled  in.  The 
number  in  the  bottom  right  hand  corner  is  highlighted  because  this  i3  the 
overall  distance  between  the  two  patterns.  This  is  the  number  which  is  shown 
in  figure  8;  it  is  the  sum  of  distances  along  the  least-cost  path  through  the 
distance  matrix.  To  find  the  path  it  is  necessary  to  remember,  at  each  point 
in  the  calculation  of  the  cumulative  distance  matrix,  exactly  which  local 
decisions  were  made  (horizontal,  vertical  or  diagonal).  Figure  10(d)  shows 
all  of  these  decisions,  and  it  can  be  seen  that  they  form  a  tree  radiating 
from  the  top  left  hand  corner  (this  is  where  the  calculation  started).  The 
actual  minimum  cost  path  is  obtained  by  tracing  back  along  the  local  decisions 
starting  at  the  bottom  right  hand  corner. 

Referring  back  to  the  distance  matrix,  figure  9,  the  calculation  shows  that 
the  least-cost  path  takes  the  route  7+1*5+12*2,  and  it  can  be  seen  that  no 
other  path  has  a  sum  lower  than  27. 

To  summarise,  the  formulation  for  the  distance  between  two  speech  patterns 
obtained  using  dynamic  time  warping  is  based  upon  the  following  recursive 
expression: 


K 

D  (  i  ,  J  )  =  min  (D(i-l,J)  ,  D(i,J-l)  ,  D(i-l,j-l)]  +  [  V  (  i  ,  k  )  -  H  (  J  ,  k  )  j  2 

k  s  1 


where  l«i*I,  and  I  is  the  number  of  frames  in  speech  pattern  V,  lijiJ,  and 
J  is  the  number  of  frames  in  H,  and  K  is  the  number  of  parameters  per  frame. 
The  overall  distance  between  the  two  patterns  V  and  H  is  D(1,J). 


CONNECTED  WORD  RECOGNITION 


The  importance  of  DTW  lies  in  two  areas.  First,  recognition  accuracy  i3  much 
greater  than  with  linear  time-normalisation.  Second,  it  has  in  fact  provided 
a  rather  neat  solution  to  the  continuity  problem.  It  has  turned  out  to  be 
possible  to  extend  the  technique  from  isolated  to  connected  words  using  a 
relatively  simple  modification  to  the  algorithm.  Conceptually,  the 
modification  can  be  understood  as  follows:-  In  the  isolated  word  situation, 
DTW  is  able  to  find  all  the  non-linear  temporal  relationships  (paths)  between 
the  unknown  pattern  and  the  reference  patterns.  Figure  8  shows  three  paths: 
the  best  path  (for  the  correct  match)  and  two  sub-optimal  paths.  The  best 
path  explains  the  relationship  between  the  unknown  word  and  one  of  the 
reference  patterns.  To  recognise  an  unknown  sequence  of  connected  words, 
therefore,  it  would  be  necessary  to  find  a  path  which  explains  the 
relationship  between  the  unknown  phrase  and  a  sequence  of  reference  patterns. 
In  practice  this  proves  to  be  fairly  easy,  it  is  merely  necessary  to  allow 
paths  to  Jump  from  reference  pattern  to  reference  pattern  whilst  computing  the 
dynamic  time  warping.  The  trajectory  of  the  best  path  then  determines  the 
recognition  result. 

Figure  11  illustrates  this  technique  quite  clearly.  The  reference  patterns 
are  the  same  words  as  in  figure  8,  but  this  time  the  unknown  pattern  consists 
of  a  sequence  of  words  (actually  "11213").  The  best  path,  determined  by  DTW, 
is  shown,  and  it  can  be  seen  to  be  jumping  around  from  reference  pattern  to 
reference  pattern.  The  trajectory  reveals  that  the  phrase  is  recognised 
correctly  as  "11213". 

There  are  a  number  of  variations  on  this  particular  technique  [5,6],  but  the 
very  simple,  yet  very  effective,  implementation  described  here  is  attributed 
to  John  Bridle  [ 7  ]  . 

The  technique  represents  a  new  and  exciting  development  in  automatic  speech 
recognition  since  the  restriction  of  using  isolated  words  may  be  removed.  As 
a  consequence  there  are  now  a  number  of  connected  word  recognisers  available 
commercially,  and  as  a  group  they  are  the  most  advanced  machines  around. 

The  potential  for  more  natural  communication  with  machines  is  obviously  higher 
with  connected  word  recognisers,  but  it  is  worth  remembering  that  they  are 
also  speaker  dependent,  and  perhaps  more  importantly,  they  do  not  take  into 
account  the  variations  which  may  occur  at  word  boundaries,  as  described 
earlier.  This  is  because  the  technique  assumes  that  a  connected  phrase 
consists  of  a  sequence  of  isolated  words  with  little  modification,  hence  such 
machines  will  not  be  able  to  recognise  the  "and"  in  "bre'm  butter". 


Figure  11:  Connected  word  recognition. 


Therefore,  in  order  to  achieve  good  recognition  accuracy  for  connected  speech, 
it  is  necessary  to  ask  the  operator  to  speak  as  clearly  as  possible,  and  not 
to  run  his  words  together  too  much.  It  is  also  common  to  train  such  a  machine 
u*<  reference  words  which  are  spoken  fairly  abruptly  (as  in  figure  11),  since 
otherwise  there  may  be  length  differences  which  are  too  great  for  the  DTW  to 
handle.  Another  scheme  is  to  train  on  word  sequences  in  order  to  include  word 
boundary  modifications  in  the  reference  patterns  [8],  This  is  usually  a 
bootstrapping  procedure  whereby  the  connected  word  recognition  algorithm 
Itself  is  used  to  extract  reference  patterns  from  a  carrier  phrase  using,  in 
the  first  instance,  normal  Isolated  references.  This  technique  is  known  as 
’embedded  training'.  Some  other  schemes  use  the  same  extraction  principle, 
but  train  on  each  possible  word  pair  sequence. 

Of  course  a  connected  word  recogniser  may  be  used  to  recognise  Isolated  words, 
and  the  performance  is  just  the  same  as  a  TW  based  isolated  word  recogniser. 


SYNTAX 


A  limiting  factor  on  the  performance  of  both  Isolated  and  connected  word 
recognisers  is  the  size  of  the  vocabulary  they  use.  In  general,  the  more 
words  there  are  in  the  vocabulary  the  worse  the  performance  will  be  (due  to 
variability).  Consequently  a  popular  technique  for  maintaining  high 
performance  with  large  vocabularies  is  to  exploit  the  fact  that  in  most  tasks 
not  every  word  can  follow  every  other  word.  In  other  words,  a  syntax  (a 
grammar)  may  be  used  to  limit  the  alternative  words  to  be  considered  by  a 
recogniser  at  each  point  in  a  sentence.  For  example,  in  a  sentence  such  as 
"hello  victor  tango  two  this  is  ...."  the  active  vocabulary  in  a  recogniser 
may  be  cut  down  to  just  the  military  alphabet  in  order  to  recognise  the  next 
word  . 

There  are  a  number  of  ways  of  specifying  a  syntax,  but  the  most  popular  is  in 
the  form  of  a  state  transition  diagram.  Figure  12  illustrates  a  syntax  for  a 
voice  controlled  calculator.  It  can  be  seen  that  the  diagram  describes 
sentences  such  as  "what  is  two  plus  four  compute"  and  "put  nine  times  alpha 
into  beta  compute".  The  overall  vocabulary  size  is  23,  but  the  maximum  number 
of  words  that  need  to  be  considered  at  any  point  is  14,  and  in  some  places 
only  one  word  is  allowed.  The  average  number  of  legal  words  is  8  and  this  is 
known  as  the  'branching  factor'  of  the  syntax,  the  lower  the  branching  factor 
the  higher  the  performance. 


what' 


.compute 


put 


ABCD  . 


compute 


OC  X’9  ^  J&L  ^ 

XX  XX  XX“C“X) 

XX  XT' 


into 


Figure  12:  Syntax  for  a  voice  controlled  calculator. 


The  implementation  of  syntax  is  very  easy  for  isolated  word  recognisers;  most 
machines  have  facilities  for  specifying  which  reference  patterns  are  to  be 
considered  during  a  recognition  match.  For  connected  word  recognisers  the  DTW 
process  may  be  modified  on  the  basis  of  a  state  transition  diagram  to  only 
allow  the  path  to  Jump  between  reference  patterns  if  such  a  jump  is  legal  in 
the  syntax.  Hence,  the  syntax  can  be  made  an  integral  part  of  the 
optimisation  process  and  connected  word  recognisers  with  this  facility  are 
able  to  find  the  best  syntactically  valid  interpretation  of  a  connected 
utterance . 

The  only  problem  using  syntax  with  a  speech  recogniser  is  that  the  user  has  to 
remember  the  allowable  sequences  of  words.  If  he  say3  a  word  which  is 
syntactically  illegal,  then  the  recogniser  may  be  forced  to  misrecognise  it, 
even  if  the  word  is  in  the  overall  vocabulary. 


MULTIPLE  SPEAKERS 


It  has  been  pointed  out  several  times  that  all  of  the  techniques  presented  so 
far  are  speaker  dependent,  due  primarily  to  the  use  of  speaker-specific 
reference  patterns.  In  general,  performance  for  a  person  using  somebody 
else’s  reference  patterns  is  pretty  poor,  the  exact  level  of  performance  being 
dependent  on  the  similarity  between  the  two  people’s  voices.  Of  course 
speaker  dependent  systems  may  be  used  by  any  number  of  users,  as  long  as  they 
each  train  the  machine  first.  If  they  do  this  each  time  they  use  it,  then 
they  will  get  better  performance  than  if  they  do  it  once  and  recall  those 
reference  patterns  on  later  occasions.  However,  if  the  vocabulary  is  large, 
the  training  may  be  too  tedious  to  do  more  than  once. 

Neither  of  tnese  techniques  is  suitable  for  the  situation  where  the  users  are 
unknown  and  thus  will  not  have  trained  the  machine  at  all  (such  as  a  person 
making  an  enquiry  over  a  telephone).  In  this  instance  the  most  successful 
technique  has  been  based  on  selecting  representative  reference  patterns  from 
range  of  speakers  sufficiently  wide  to  cover  the  pronunciation 
the  expected  users  [9].  It  has  been  found,  using  the 

technique  on  data  from  fifty  male  and  fifty  female  speakers, 
and  12  reference  patterns  per  word  gives  good  performance  over 
unknown  speakers.  Of  course  there  is  a  limit  to  how  far  this  procedure  can 
applied.  Eventually,  with  a  large  cross-section  of  accents,  problems 
ambiguity  arise  because  the  pronunciations  of  different  words  begin 
overlap.  Nevertheless,  if  the  accent  variations  in  an  expected 
population  are  relatively  small,  then  the  technique  can  be  quite  useful. 
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NOISE 


In  most  environments  noise  is  always  present,  and  this  is  another  source  of 
variability  when  it  comes  to  recognising  speech.  The  effects  of  noise  on  a 
recogniser  are  threefold,  first,  the  segmentor  may  make  errors  in  determining 
when  speech  is  present,  second,  noisey  speech  is  more  likely  to  be 
mi srecognised ,  and  third,  the  speaker  may  change  his  vocal  characteristics 
because  of  the  noisey  environment. 

There  are  a  number  of  techniques  which  can  be  used  to  combat  the  effects  of 
noise,  but  first  it  is  worth  pointing  out  that,  whatever  the  situation,  higher 
performance  is  almost  always  obtained  if  training  is  done  in  the  environment 
in  which  a  machine  is  going  to  be  used.  This  is  particulary  true  in  a  noisey 

envi ronment . 
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Obviously  a  noise  cancelling  microphone  helps  considerably  to  overcome 
background  noise,  since  a  less  noisey  speech  signal  then  reaches  the 
recogniser.  It  is  also  possible  to  use  a  separate  piece  of  noise  cancelling 
equipment  between  the  microphone  and  the  recogniser.  Alternatively,  noise 
compensation  may  be  integrated  directly  into  the  recognition  algorithm  itself 
[10].  In  particular,  the  frame  to  frame  comparison  process  in  the  DTW  may  be 
modified  to  take  into  account  an  estimate  of  the  effects  that  the  noise  ha3  on 
the  individual  parameters  in  the  frames,  hence  recognition  proceeds  by 
actively  ignoring  data  which  is  known  to  be  noi3ey. 

If  the  noise  is  impulsive,  rather  than  continuous,  then  it  is  sometimes 
possible  to  train  the  recogniser  on  these  sounds,  and  then  allow  it  to 
recognise  them  as  they  occur.  This  technique  has  been  found  to  be 
particularly  successful  in  coping  with  breathing  noises'. 


CONCLUSION 


This  lecture  has  provided  a  brief  overview  of  a  number  of  techniques  which  are 
central  to  the  operation  and  application  of  practical  automatic  speech 
r'-jognition  equipment.  Many  of  the  algorithms  are  relatively  simple  in 
concept,  and  very  few  of  the  many  problems  facing  automatic  speech  recognisers 
have  been  satisfactorily  solved.  Nevertheless,  the  techniques  are  such  that 
machines  are  now  available  which  display  a  level  of  performance  which  is 
suitable  for  many  limited  applications  [11]. 


REFERENCES 


[1]  K  H  Davis,  R  Biddulph  and  S  Balashek.  "Automatic  recognition  of 
spoken  digits".  J.  Acoust.  Soc .  Amer.  ,  vol.24,  1952,  pp  637-642. 

[2]  L  R  Rabiner  and  M  R  Sambur.  "An  algorithm  for  determining  the 

endpoints  of  isolated  utterances".  Bell  Syst.  Tech.  J.,  vol.54, 

1975,  pp  297-315. 

[3]  F  Itakura.  "Minimum  prediction  residual  principle  applied  to  speech 
recognition".  IEEE  Trans.  Acoust.  Speech,  Signal  Processing,  vol.23, 
1975,  pp  67-72. 

[4]  L  R  Rabiner  and  J  G  Uilpon.  "A  simplified,  robust  training  procedure 
for  speaker  trained  isolated  word  recognition".  J.  Acoust.  Soc. 
Amer.,  vol.68,  1980,  pp  1271-1276. 

[5]  H  Sakoe.  "Two-level  DP  matching  -  a  dynamic  programming  based 

pattern  matching  algorithm  for  connected  word  recognition".  IEEE 
Trans.  Acoust.  Speech,  Signal  Processing,  vol.27,  1979,  PP  588-595. 

[6]  C  S  Myers  and  L  R  Rabiner.  "Connected  digit  recognition  using  a 

level  building  DTW  algorithm".  IEEE  Trans.  Acoust.  Speech,  Signal 
Processing,  vol.29,  1981,  pp  284-297. 

[7]  J  S  Bridle  and  M  D  Brown.  "Connected  word  recognition  using  whole 
word  templates".  Proc.  Inst.  Acoust.,  Autumn  1979. 

[8]  L  R  Rabiner,  A  Bergh  and  J  G  Hilpon.  "An  improved  training  procedure 
for  connected-digit  recognition".  Bell  Syst.  Tech.  J.,  vol.6l,  1982, 
pp  981-1001. 

[9]  L  R  Rabiner,  S  E  Levinson,  A  E  Rosenberg  and  J  G  Wilpon.  "Speaker- 

independent  recognition  of  isolated  words  using  clustering 

techniques".  IEEE  Trans.  Acoust.  Speech,  Signal  Processing,  vol.27, 
1979,  pp  336-349. 

[10]  D  H  Klatt.  "A  digital  filter  bank  for  spectral  matching".  Proc. 
IEEE  Int.  Conf.  Acoust.  Speech,  Signal  Processing,  1976,  pp  573-576. 

[11]  L  R  Rabiner  and  S  E  Levinson.  "Isolated  and  connected  word 

recognition  -  theory  and  selected  applications".  IEEE  Trans. 
Communications,  vol.29,  1981,  pp  621-659. 


-• _ 


i 


C  )  Controller  HMS0  London  1983 


9- 


.  1 

9 


Speaker  Differences  in  Speech  and  Speaker 
Recognition 

Melvyn  J.  Hunt 

National  Research  Council  of  Canada 
National  Aeronautical  Establishment 
U66,  Montreal  Road 
Ottawa,  Ontario 
KlA  0R6 
Canada 


Summary 

This  talk  is  concerned  with  the  differences  between  speakers.  The 
range  of  ways  in  which  speakers  differ  is  surveyed,  with  distinctions 
being  drawn  on  the  one  hand  between  physiological  and  .usage 
differences  and  on  the  other  hand  between  those  differences  stem¬ 
ming  from  the  larynx  and  those  stemming  from  the  vocal  tract. 
Methods  of  dealing  with  speaker  differences  in  speaker- 
independent  and  speaker-adaptive  speech  recognition  systems  are 
discussed.  This  is  followed  by  a  discussion  of  the  exploitation  of 
speaker  differences  in  speaker  recognition  systems.  The  latter  dis¬ 
cussion  is  divided  into  a  consideration  of  speaker  verification,  in 
which  a  speaker  is  trying  to  prove  his  identity,  and  speaker 
identification,  in  which  the  identity  of  an  unknown  speaker  has  to 
be  discovered  and  the  speaker  cannot  be  expected  to  cooperate  in 
producing  a  predetermined  phrase.  The  talk  is  concluded  by  a 
summary  of  the  present  state  of  the  art  in  dealing  with  speaker 
differences  together  with  some  guesses  about  the  prospects  for 
practical  systems  in  the  near  future. 


Introduction 

This  session  is  concerned  with  the  differences  between  the  speech  of  speakers  of  the 
same  language.  I  want  to  look  at  what  sort  of  differences  there  might  be,  and  how 
those  differences  might  be  useful  in  some  tasks  where  the  aim  is  to  determine  the 
identity  of  the  speaker,  and  a  nuisance  in  others  where  the  aim  is  automatic  recogni 
tion  of  what  is  being  said  irrespective  of  who  is  saying  it. 

What  sort  of  differences  are  there? 

One  way  in  which  speakers  differ  is  in  the  choice  of  the  words  and  expressions  they 
use.  Humans  probably  make  use  of  such  information  in  recognizing  each  other,  but  I 
see  little  possibility  of  any  automatic  system  having  that  capability  in  the  near  future, 
so  1  propose  to  leave  this  kind  of  speaker  difference  out  of  the  discussion. 

To  consider  how  differences  in  voices  can  arise,  it  is  useful  to  go  back  to  the 
source/filter  model  of  speech  production  that  1  talked  about  in  the  earlier  session. 
You  may  remember  that  the  production  of  voiced  speech  can  be  quite  accurately 
modeled  as  an  acoustic  source,  the  larynx,  feeding  into  a  linear  filter,  the  vocal  tract. 
Both  the  source  and  filter  contain  speaker-dependent  characteristics,  these  charac¬ 
teristics  depending  in  both  cases  partly  on  physiology  and  partly  on  usage. 

Excitation  Differences 

The  larynx  varies  in  size  between  individuals;  in  particular,  a  man’s  larynx  is  much 
larger  than  a  woman's  and  consequently  average  values  of  adult  male  fundamental  fre¬ 
quency  (around  100Hz)  are  much  lower  than  corresponding  values  for  adult  females 
(around  200Hz). 

Fundamental  frequency  is  also  to  some  extent  under  the  control  of  the  speaker. 
Differences  in  speakers’  use  of  fundamental  frequency  seem  to  be  to  be  describable  in 
two  ways:  we  can  describe  them  in  an  absolute  sense,  for  example  by  measuring  the 
variance  of  fundamental  frequency  about  its  mean  value,  or  we  can  describe  them 
with  reference  to  the  associated  sentence  structure  by  noting  such  phenomena  as  a 
tendency  for  fundamental  frequency  to  rise  or  fall  at  the  occurrence  of  a  particular 
syntactic  feature  in  a  sentence.  I  suspect  that  the  latter  kind  of  description  is  much 
more  important  in  distinguishing  speakers,  and  the  former  much  less  important,  than 
we  normally  imagine.  Evidence  for  this  contention  comes  from  some  work  in  language 
discrimination  carried  out  by  Maidment  [  1  ].  He  had  a  group  of  subjects  listen  to  the 
fundamental  frequency  patterns  of  sentences  taken  from  conversations  some  of  which 
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were  in  English  and  some  in  French.  The  signal  was  taken  electrically  directly  from  the 
larynx,  and  the  vocal  tract  had  essentially  no  influence  on  it.  French  and  English  into¬ 
nation  patterns  normally  sound  quite  different,  yet  in  trying  to  identify  which 
language  they  were  hearing  from  these  intonation  patterns  with  their  usual  accompan¬ 
iment  of  vocal-tract  information  removed  most  subjects  performed  rather  poorly 
(average  63%  correct,  against  a  chance  level  of  50%). 

The  details  of  the  action  of  the  vocal  cords  within  each  excitatory  cycle  also  vary 
between  speakers,  and  are  the  major  factor  in  what  is  known  as  voice  quality  [2]. 
Among  others,  the  adjectives  breathy,  harsh  and  creaky  are  used  to  describe  lar- 
yngeally  determined  voice  qualities.  We  know  relatively  little  about  what  causes 
different  voice  qualities,  or  even  how  to  classify  them,  but  it  seems  probable  that,  like 
fundamental  frequency,  voice  quality  is  partly  physiologically  determined  and  partly 
under  the  control  of  the  speaker.  It  is  likely  that  in  good  acoustic  conditions  listeners 
use  voice  quality  information  to  identify  speakers,  but  it  tends  to  get  lost  when  the 
speech  is  distorted  as  it  is  by  being  transmitted  over  a  telephone  link  for  example. 
Vocal  Tract  Differences 

As  far  as  speech  production  is  concerned,  the  most  obvious  and  probably  most  impor¬ 
tant  physiological  difference  between  vocal  tracts  is  in  their  length.  In  particular, 
women  tend  to  have  vocal  tracts  about  15%  shorter  than  men,  and  adult  female  vocal 
tract  resonances  are  consequently  about  15%  higher  in  frequency  than  corresponding 
male  resonances.  Whether  length  differences  result  in  a  simple  linear  scaling  of  fre¬ 
quencies  is  still  questioned,  but  linear  scaling  does  seem  to  be  a  reasonable  first 
approximation. 

There  are  certainly  other  physiological  differences  in  vocal  tracts  -  differences  in 
absorption  in  tract  walls,  for  example,  which  would  affect  resonance  bandwidths  -  but 
for  the  most  part  their  effects  on  speech  have  not  been  extensively  studied. 

Differences  in  vocal  tract  usage  are  of  many  different  kinds.  They  range  from  an 
idiosyncratic  pronunciation  of  a  single  word  by  a  single  individual  to  a  tendency  for  a 
whole  dialect  group  to  use  a  particular  vocal  tract  setting  in  a  number  of  speech 
sounds.  As  an  example  of  the  second  extreme,  many  speakers  from  the  Birmingham 
area  of  England  tend  to  have  the  back  of  the  tongue  slightly  raised  throughout  much 
of  their  speech. 

Between  these  two  extremes  there  can  be  differences  in  how  a  particular  phoneme  is 
realised.  Sometimes,  the  differences  can  affect  a  whole  sequence  of  phonemes.  For 
example,  in  the  speech  of  many  speakers  of  "standard"  European  French  there  are 
three  nasalized  vowel  phonemes  that  occur  in  the  words  bon,  banc,  bain;  in  the  speech 
of  many  French  Canadians  bon  is  pronounced  like  standard  French  banc,  banc  like 
bain,  and  Lain  is  pronounced  with  a  nasalized  diphthong  not  occurring  in  standard 
French.  Sometimes,  phonemes  can  show  context-dependent  differences:  to  take 
another  French  Canadian  example,  the  /i/  phoneme  occurring  in  mite  is  pronounced 
by  many  French  Canadians  rather  like  the  English  vowel  in  bit,  but  only  when  it  is  fol¬ 
lowed  by  a  consonant,  otherwise  it  has  the  tenser,  standard  French  form. 

Finally,  a  particular  speaker  or  group  of  speakers  can  have  a  different  system  of 
phonemes  from  the  other  speakers  of  the  language:  the  northern  English  dialect  that  I 
grew  up  speaking,  for  example,  has  only  one  phoneme  for  the  vowels  that  occur  in  luck 
and  look  or  putt  and  put,  so  the  pairs  of  words  sound  identical,  whereas  standard  Brit¬ 
ish  English  has  two  phonemes  and  the  pairs  of  words  are  distinguishable  by  their 
pronunciations. 

At  this  point,  I  would  like  to  say  that  I  am  now  going  to  look  at  how  these  various  kinds 
of  differences  are  systematically  handled  and  exploited  in  automatic  speech  and 
speaker  recognition  systems;  but  our  understanding  in  this  area  is  not  yet  that 
advanced.  Speaker  variations  are  dealt  with  instead  by  a  set  of  pragmatic  approaches, 
and  the  most  I  can  suggest  is  that  in  considering  these  approaches  one  should  keep 
the  various  sources  of  variation  in  mind. 

Speaker  differences  in  speech  recognition 

Differences  between  speakers  pose  a  major  problem  for  developers  of  speech  recogni¬ 
tion  devices  intended  for  use  by  more  than  one  person.  Contributors  at  the  first  NATO 
Advanced  Study  Institute  on  Spoken  Language  Generation  and  Understanding  held  at 
Bonas,  France  in  1979  were  asked  to  identify  the  major  outstanding  problem  in  speech 
recognition.  Without  exception  they  named  the  need  for  speaker  independence.  In  the 
same  year  panel  members  at  the  IEEE  Workshop  on  Speech  Recognition  held  in  Fitts 
burgh  were  asked  the  same  question  and  gave  the  same  unanimous  reply.  Today,  four 
years  later,  there  are  still  very  few  speech  recognition  devices  available  that  work  reli¬ 
ably  for  anyone  other  than  the  person  who  provided  the  training  material. 

The  importance  of  a  capacity  to  accept  input  from  more  than  one  speaker  depends,  of 
course,  on  the  application.  In  a  central  system  being  accessed  briefly  by  a  large 
number  of  military  or  civilian  users  speaker  independence  is  essential.  On  the  other 
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hand,  it  is  not  particularly  important  for  a  device  that  is  to  be  used  for  a  long  period 
by  one  person,  particularly  if  the  vocabulary  is  small  so  that  total  retraining  is  not  a 
lengthy  process.  Pilots  could,  for  example,  carry  a  cassette  recording  with  them  that 
would  allow  them  to  retrain  a  system  to  their  own  voice  before  a  flight  without  having 
to  speak  the  training  material  each  time.  Even  in  this  case,  however,  research  on 
speaker  independence  may  well  prove  relevant,  because  it  is  likely  that  in  learning 
how  to  make  a  system  tolerant  to  variations  between  speakers  we  will  also  learn  how 
to  make  them  tolerant  to  changes  occurring  in  the  same  speaker  caused,  for  example, 
by  high  accelerations  or  by  high  levels  of  psychological  stress. 

Before  I  go  on,  I  should  point  out  the  speciousness  of  a  claim  that  is  sometimes  made 
for  speaker-dependent  devices,  namely  that  they  provide  a  measure  of  security 
against  unauthorized  use.  A  speaker-dependent  device  simply  makes  more  recognition 
errors  when  used  by  a  speaker  who  did  not  train  it.  A  level  of  accuracy  that  was  unac 
ceptably  low  for  the  legitimate  user  could  be  perfectly  acceptable  for  a  determined 
unauthorized  user.  To  claim  speaker  dependence  as  an  advantage  seems  rather  like 
saying  that  small,  cramped  cars  have  the  advantage  of  being  unlikely  to  be  stolen  by 
tall  thieves. 

How  can  speaker  inHpppnH enee  he  achieved? 

Human's  carry  out  speaker-independent  speech  recognition  all  the  time.  When  a 
stranger  starts  to  speak  we  usually  understand  him  immediately.  If  he  speaks  with  an 
unfamiliar  accent  and  if  what  he  says  is  not  highly  predictable  from  the  situation,  our 
recognition  is  error-prone,  but  it  still  enormously  better  than  most  artificial  recogniz 
ers  can  manage.  It  may  be  helpful,  then,  to  ask  ourselves  how  humans  manage  the 
task.  There  seem  to  me  to  be  three  mechanisms  that  may  be  involved,  a  listener  may 
derive  acoustic/phonetic  features  from  the  speech  that  are  invariant  across  speakers; 
he  may  accept  a  set  of  alternative  production  forms;  or  he  may  deduce  some  general 
characteristics  of  the  speech  he  is  hearing  and  use  them  to  adapt  his  recognition 
process. There  is  some  dispute  about  the  relative  importance  of  the  first  and  last  pos¬ 
sibilities  [3,4],  but  it  seems  likely  that  all  three  mechanisms  are  involved  to  some 
degree.  The  fact  that  we  do  a  fairly  good  job  of  understanding  a  new  speaker  immedi¬ 
ately  suggests  that  our  analysis  of  the  speech  signal  is  good  at  extracting  speaker- 
independent  cues.  There  is  no  doubt,  however,  that  our  comprehension  of  a  new 
speaker,  particularly  one  speaking  an  unfamiliar  dialect,  does  get  better  after  we  have 
heard  a  fe**-  sentences  from  him  and  thus  had  a  chance  to  adapt  to  his  voice.  Finally, 
we  must  store  alternative  forms  of  at  least  some  words  in  order  to  handle  a  case  like 
the  word  either,  which  has  two  distinct  pronunciations,  and  the  pronunciation  that  a 
particular  speaker  will  choose  is  not  predictable  from  the  rest  of  his  speech.  As  we 
shall  see,  automatic  speech  recognition  systems  have  incorporated  all  three  of  these 
mechanisms  to  varying  extents. 

Speaker, invariant  acnnstir  representations 

The  search  for  invariants  has  been  carried  out  in  two  ways.  The  first  is  to  assume  that 
the  acoustic  representation  generated  by  the  human  ear  is  one  which  minimizes 
speaker  differences,  and  a  representation  that  faithfully  copies  the  ear  should  conse¬ 
quently  provide  a  degree  of  speaker  independence.  Some  workers  [5,6]  have  reported 
evidence  tending  to  confirm  this  assumption,  and  at  least  one  connected  speech 
recognition  system  has  achieved  speaker  independence  by  carefully  modeling  neural 
behavior  in  the  ear  together  with  details  of  the  phonetic  characteristics  of  the 
language  used  (the  only  published  account  I  can  find  of  this  work  [7]  deals  with  an  ear¬ 
lier,  isolated  word  system). 

Alternatively,  features  in  the  speech  signal  that  are  invariant  across  speakers  can  be 
sought  by  statistical  methods.  The  simplest  form  of  this  approach  is  to  apply  suitably 
chosen  weights  to  the  features  used  in  the  comparison  of  the  speech  to  be  recognized 
with  the  reference  forms.  Each  weight  would  depend  inversely  on  the  measured  inter¬ 
speaker  variability  of  the  corresponding  feature.  Thus,  if  the  ends  of  words  were 
found  to  vary  more  across  speakers  than  the  beginnings  and  middles  they  would  be 
given  less  weight  in  the  comparison  process.  The  same  applies  to  regions  of  the  spec 
trum:  My  expe-ience  [8],  for  instance,  has  been  that  when  comparing  power  srectra 
across  speakers  for  equivalent  speech  sounds  the  energy  profile  in  the  first  30c  Iz  is 
much  more  variable  than  the  rest  of  the  spectrum  (presumably  because  of  diiiering 
fundamental  frequencies),  and  it  therefore  helps  considerably  to  reduce  the  weight 
given  to  this  portion  of  the  spectrum  in  the  matching  process.  I  believe  that,  the  Ver 
bex  (formerly  Dialog)  speech  recognition  system  [9],  whose  use  by  the  State  of  Illinois 
civil  service  constituted  the  first  large  scale  application  of  a  speaker  independent  sys 
tern,  used  variability  weighting  as  a  function  both  of  frequency  and  position  within  the 
word.  A  more  sophisticated  approach  [10]  takes  linear  combinations  of  features  to 
derive  speaker-independent  linear  discriminant  functions. 
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Alt  ernat  ive  reference  forms 

Turning  to  the  use  of  discrete,  alternative  forms,  we  have  to  draw  a  distinction 
between  two  kinds  of  speech  recognition  systems.  In  the  first  kind,  which  I  will  call 
non-segmenting  systems  the  basic  reference  forms  are  whole-word  templates 
represented  as  a  sequence  of  vectors  describing  the  power  spectrum  and  spaced  regu 
larly  apart  in  time,  typically  every  ten  milliseconds.  In  the  second  kind,  words  are 
broken  down  into  a  sequence  of  phonetically  classified  segments  of  variable  duration. 
In  these  segmenting  systems,  it  is  the  segments  corresponding  very  loosely  to 
phonemes  in  some  cases  -  rather  than  words  that  form  the  set  of  basic  reference 
units.  Although,  for  wrhat  seem  to  me  to  be  good  reasons,  the  trend  in  practical  recog 
nizers  is  strongly  towards  word  based  systems,  the  segmenting  approach  is  undeniably 
more  efficient  at  representing  alternative  forms  of  words  when  the  difference  is  local 
ized  in  one  part  of  the  word  as  it  is  in  the  first  syllable  of  my  either  example.  This  is 
because  it  is  relatively  easy  for  the  system  developer  to  represent  variable  portions  of 
words  by  constructing  branching  networks  from  the  segmental  reference  units.  Such  a 
network  representation  was  used,  for  example,  in  the  Harpy  system  [  1  l]  developed  as 
part  of  the  ARP  A  Speech  Understanding  Project.  It  has  been  suggested  [12]  that 
branching  within  words  could  be  incorporated  into  non  segmenting  systems,  and  at 
least  one  group  has  described  experiments  in  constructing  such  templates  |  13]  The 
task  is  much  less  straightforward  than  in  the  segmenting  case,  however. 

In  practice,  the  only  way  that  non  segmenting  systems  have  allowed  for  alternative 
forms  is  by  having  separate  whole-word  templates  for  each  variant.  The  multiple  tern 
plates  for  each  word  are  usually  created  by  taking  examples  of  the  word  from  a  hun 
dred  or  more  speakers  and  averaging  together  groups  of  examples  that  arc  found  to 
be  similar.  The  Verbex  system  mentioned  earlier  used,  I  believe,  three  such  templates 
per  word.  Recognition  systems  have  been  demonstrated  at  Bell  Labs  [14]  that  rely 
exclusively  on  multiple  templates  to  obtain  successful  speaker-independent  perfor 
mance.  Typically,  six  variants  are  used  for  each  word  when  the  speaker  population 
forms  a  homogeneous  dialect  group. 

It  has  sometimes  been  claimed  that  multiple  templates  represent  by  themselves  a 
satisfactory  solution  to  the  problem  of  speaker  differences  in  speech  recognition. 
While  the  success  of  the  approach  is  very  impressive,  I  do  not  believe  that  they  consti 
Lute  the  entir  1  solution  for  the  following  reasons.  First,  the  use  of  n  templates  for 
each  word  means  n  times  more  storage  and  n  times  more  computation,  so  practical 
considerations  dictate  that  n  should  be  as  small  as  possible.  Second,  the  classic  work 
of  Peterson  and  Barney  [15]  showed  that  a  sound  (specified  in  terms  of  the  first  two 
formant  frequencies)  that  would  represent  one  vowel  phoneme  when  produced  by  one 
speaker  could  represent  a  different  vowel  phoneme  when  produced  by  another 
speaker.  These  repeatedly  confirmed  results  suggest  to  me  that  in  automatic  systems 
in  which  speaker  differences  are  handled  exclusively  by  multiple  templates,  the  dist.ri 
butions  in  acoustic  space  of  templates  representing  different  words  will  inevitably 
start  to  overlap  as  we  move  to  larger  vocabularies.  If  the  system  makes  no  a  ompt  to 
learn  something  of  the  characteristics  of  the  current  speaker,  it  will  have  no  way  of 
making  reliable  recognition  decisions  in  the  overlapping  regions.  Finally,  while  some 
differences  in  the  pronunciation  of  a  particular  word  are  definitely  discrete,  others 
such  as  those  resulting  from  differences  in  vocal  tract  length  are  continuous  in 
nature.  To  try  to  represent  a  continuous  range  of  variation  by  a  few  discrete  points 
seems,  to  say  the  least,  inelegant 

Speaker  adaplat  inn 

The  potential  effectiveness  of  the  third  approach  to  obtaining  speaker  independence, 
namely  adaptation  to  the  current,  speaker,  depends  very  much  on  the  intended  appli 
cation.  The  process  inevitably  takes  time,  and  if  it  is  to  be  worthwhile,  each  new 
speaker  must  continuously  use  a  system  for  a  period  several  tunes  longer  than  the 
time  needed  for  useful  adaptation  On  the  other  hand,  if  the  expected  period  of  use  is 
quite  long  and  the  vocabulary  quite  small,  the  time  overhead  in  having  each  new  user 
re  enter  the  complete  vocabulary  may  not  seem  unreasonable,  and  it  will  almost  cer 
tainiy  lead  to  better  performance  than  the  more  sophisticated  adaptation  schemes. 
Adaptation  material  can  be  collected  by  having  the  new  speaker  utter  a  predoter 
mined  phrase  before  he  starts  to  use  'he  system,  or  it  can  be  gathered  on  /■>,,.  Itl  (he 
initial  portion  of  the  speaker’s  use  of  the  system  On  lino  adaptation  is  much  less 
obtrusive,  but  its  use  depends  on  the  system  being  able  to  provide  a  i  (msiiim hie  level 
of  recognition  performance  before  any  adaptation  has  taken  place  Morrow  r  unless 
careful  verification  is  carried  out  to  ensure  that  early  inputs  have  hem  correct  Iv 
recognized,  there  is  a  danger  that  the  system  might  try  I  o  extract  adapt, ,1  on  infer 
mat  ton  from  incorrectly  recognized  material  with  disastrous  r<  suits 

Turning  to  how  adaptation  can  be  achieved,  it  seems  to  me  t  hit  segment  nr  systems 
are  at  an  advantage  once  again  Hvsteins  like  Harpy  with  a  hundred  or  so  --pooch 
sound  templates  Can  effie le nt  iy  update  l  her  r  complete  si  t  t  >r  templates  hi  j  -  up  |  pc 
user  utter  a  suit  ably  chosen  phrase  cord  ainnig  all  the  so  mils  m  I  he  i  :;v  i  ' ,  r\  [  ;f;J 
Furui  has  shown  |  1 7 1  that  by  taking  account  of  the  corn  -it  ions  that  <  vest  u  speaker 
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differences  in  different  speech  sounds  it  is  possible  to  begin  to  update  speech  sound 
templates  before  an  example  of  that  speech  sound  has  been  given. 

In  word-based  systems  one  way  of  achieving  on  line  adaptation  if  the  vocabulary  is 
small  is  to  update  the  template  for  each  word  as  it  occurs  in  the  input.  The  updating 
can  consist  of  a  direct  replacement  of  the  old  template  by  the  newly  received  example 
or  of  an  averaging  together  of  the  old  template  and  the  new  example.  The  Verbex  sys¬ 
tem  mentioned  earlier  incorporated  this  kind  of  adaptation. 

A  second  way  in  which  word-based  systems  can  adapt  is  to  assume  that  the  differences 
between  corresponding  speech  sounds  for  two  speakers  are  substantially  the  same. 
That  is,  the  difference  between  one  speaker's  /i/  phoneme  and  another  speaker's  /i/ 
phoneme  is  assumed  to  be  similar  to  the  difference  in  the  productions  of  their  /e/ 
phonemes,  for  example.  Such  an  assumption  is  likely  to  be  valid  for  excitation 
differences  and  for  differences  resulting  from  physiological  characteristics  of  their 
vocal  tracts,  such  as  their  lengths,  but  it  will  not  in  general  be  true  for  differences  in 
usage  of  their  vocal  tracts.  In  so  far  as  the  assumption  is  valid,  speaker  differences 
can  be  determined  by  time-aligning  corresponding  words  from  the  two  speakers  so 
that  the  aligned  power-spectrum  vectors  correspond  to  equivalent  speech  sounds. 
Any  consistent  differences  across  pairs  of  aligned  vectors  can  then  be  used  to  con¬ 
struct  a  speaker-adapting  spectral  transformation.  In  some  digit  recognition  experi¬ 
ments  I  carried  out  [18]  the  use  of  transformations  derived  in  this  way  reduced  the 
average  recognition  error  rate  by  a  factor  of  two  after  just  three  digits  had  been 
input. 

To  conclude  this  section,  it  seems  likely  to  me  that  future  successful  large-vocabulary 
speaker-independent  systems  are  likely  to  include  an  attempt  at  a  speaker-invariant 
acoustic  representation  together  with  multiple  versions  of  at  least  some  words  and  a 
capacity  for  speaker  adaptation.  The  three  approaches  are  not  competitors:  on  the 
contrary,  they  are  likely  to  be  more  effective  when  they  operate  in  concert. 

Speaker  Recognition 

Speaker  recognition  is  the  positive  side  of  speaker  differences.  I  mean  the  heading  to 
cover  two  quite  distinct  classes  of  problem:  speaker  verification,  in  which  characteris¬ 
tics  of  a  speaker's  voice  are  used  to  verify  that  he  is  who  he  claims  to  be;  and  speaker 
identification,  in  which  there  is  an  attempt  to  determine  whether  some  speech  to  be 
identified  could  have  been  generated  by  one  of  the  speakers  known  to  the 
investigators.  I  want  to  confine  the  discussion  here  to  automatic  methods,  and  so 
leave  out  of  account  the  use  of  human  listeners,  with  or  without  the  use  of  spectro 
grams,  sometimes  misleadingly  referred  to  as  voiceprints. 

Despite  a  fair  amount  of  experimental  effort  on  the  two  problems,  there  have  been 
remarkably  few  practical  implementations  of  speaker  verification,  and  -  as  far  as  I  can 
tell  -  no  practical  use  of  automatic  speaker  identification  up  to  now. 

.Spp«k-<=>i-  Vpi-ifi^fllion 

Speaker  verification  has  potential  applications  in  the  control  of  physical  access  to 
secure  areas  and  in  the  control  of  remote  access  to  sensitive  information,  such  as  an 
ability  to  confine  access  to  personal  bank  account  information  to  the  account  holder. 
I  think  that  the  remote  access  applications  are  more  interesting  because  there  are 
few  fully  automatic  alternatives. 

Verification  is  set  apart  from  speaker  identification  partly  by  the  fact  that  the  com¬ 
parison  process  is  essentially  one-to-one  rather  than  many-to-one.  A  much  more 
important  difference,  though,  is  that  the  speaker  is  cooperative  and  can  therefore  be 
induced  to  utter  a  particular  phrase.  This  utterance  can  then  be  compared  with  a  ver¬ 
sion  of  the  same  phrase  known  to  have  been  produced  by  the  person  that  the  current 
speaker  is  claiming  to  be.  In  this  way,  equivalent  speech  sounds  in  identical  contexts 
can  be  compared. 

Speaker  verification  systems  have  been  tested  with  some  success  over-telephone  links 
[19,20,21],  generally  with  some  attempt  at  reducing  sensitivity  to  linear  distortions. 
Although  the  ability  to  use  text-dependent  methods  makes  verification  generally 
easier  than  identification,  the  reliability  demanded  of  a  verification  system  may  be 
much  higher.  Identification  used  in  the  early  stages  of  a  police  investigation  to  help 
reduce  a  long  list  of  suspects  to  a  shorter  list  can  be  tolerant  of  occasional  errors, 
whereas  in  a  verification  system  controlling  access  to  a  sensitive  site  a  single  error 
could  be  very  damaging.  Moreover,  verification  systems  are  more  prone  to  attack  by 
deliberate  imposters.  If  the  same  phrase  is  always  used,  an  imposter  could  potentially 
become  proficient  at  mimicking  another  speaker's  production  of  the  phrase.  Alterna¬ 
tively,  if  he  could  procure  a  recording  of  the  other  speaker  uttering  the  test  phrase, 
he  could  fool  the  system  simply  by  playing  the  recording.  It  was  presumably  for  these 
reasons  that  in  the  speaker  verification  system  controlling  access  to  a  Texas  Instru¬ 
ments'  computer  room  [22]  the  system  permuted  the  words  of  the  original  reference 
utterance  and  demanded  an  unpredictable  phrase  each  time  an  individual  presented 
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himself. 

S  ipakpr  Identification 

Interest  in  automatic  speaker  identification  comes  from  the  police,  security  and  intel 
ligence  organizations,  and  also  from  accident  investigators  trying  to  determine,  for 
example,  who  said  what  just  before  a  plane  crash. 

Automatic  speaker  identification  is  faced  with  two  problems  that  combine  to  make  the 
task  particularly  difficult.  The  first  is  the  lack  of  control  over  what  the  speaker  says. 
The  second  is  that  in  almost  every  practical  application  the  signal  from  which  the 
speaker  is  to  be  recognized  -  telephone  call,  intercepted  military  radio  transmission, 
or  aircraft  cockpit  recording  -  must  be  expected  to  have  sufTered  a  significant  degree 
of  distortion. 

Let  us  look  first  at  the  implications  of  lack  of  control  over  the  text.  As  we  have  seen, 
the  great  strength  of  speaker  verification  is  that  by  controlling  the  text  it  is  able  to 
compare  equivalent  speech  sounds  in  equivalent  contexts.  Automatic  speaker 
identification  clearly  cannot  do  this,  and  I  see  little  hope  in  the  foreseeable  future  of 
an  automatic  system  being  able  to  spot  instances  of  a  particular  speech  sound  in  a 
speaker-independent  manner  and  on  transmission-degraded  speech  with  a  useful 
degree  of  reliability.  Even  if  a  system  could  be  made  to  spot  target  sounds  most  of  the 
time,  the  occasional  misses  and  false  alarms  that  would  inevitably  occur  would  seri¬ 
ously  bias  the  data  being  gathered. 

Given  these  limitations,  almost  all  attempts  at  automatic  speaker  identification  have 
worked  by  seeking  to  produce  a  statistical  description  of  the  speech  typically  in 
terms  of  the  properties  of  power  spectra  computed  every  centisecond  -  without  any 
regard  to  what  is  being  said.  Thus,  much  of  what  constitutes  the  difference  between 
two  speakers  will  be  blurred  over  -  if  a  speaker  pronounces  luck  like  Look  the  system 
will  hardly  notice.  It  seems  rather  like  taking  a  handwritten  text  and  cutting  each  line 
into  a  set  of  thin  vertical  slices,  each  slice  a  few  times  thinner  than  the  average  letter, 
and  then  trying  to  determine  the  identity  of  the  writer  from  the  statistical  properties 
of  an  assorted  heap  of  the  slices.  The  surprising  thing  is  that  on  undistorted  speech 
such  a  method  works  quite  well. 

The  traditional  way  of  producing  the  statistical  description  [23]  starts  by  computing  a 
set  of  statistical  parameters  -  means,  variances,  etc.  -  of  short-term  signal  properties 
-  filter  bank  channel  energies,  linear  predictor  coefficients,  etc.  The  differences  are 
noted  in  these  parameters  between  speakers  compared  with  their  variation  between 
speech  samples  taken  from  the  same  speaker.  This  provides  a  primary  measure  of 
the  usefulness  of  a  parameter  in  speaker  identification.  However,  such  parameters 
rarely  turn  out  to  be  statistically  independent  of  each  other,  so  their  correlations  also 
have  to  be  taken  into  account.  Given  this  information,  and  making  certain  assump¬ 
tions  about  the  statistical  distributions  of  the  parameters,  a  linear  transformation  of 
the  parameters  derived  from  each  sample  can  be  computed.  Provided  the  assump¬ 
tions  hold,  the  transformation  provides  optimum  discrimination  between  the  sample 
sets  belonging  to  the  different  speakers.  The  distances  in  the  transformed  parameter 
space  are  known  as  Mahalanobis  dislances,  and  the  process  forms  part  of  what  is 
known  as  Linear  discriminant  analysis. 

Recently,  a  couple  of  experiments  have  been  described  that  step  outside  the  tradi 
tional  framework.  In  the  first  [24],  the  distribution  of  short  term  signal  features  (in 
this  case,  linear  prediction  log  area  ratios)  in  the  speech  to  be  identified  is  compared 
non  parametrically  with  the  distributions  generated  by  known  speakers  In  the  second 
[ 25 1 ,  the  speech  of  each  of  the  known  speakers  is  modeled  by  a  Markov  chain,  and  the 
probability  that  a  known  speaker  could  have  generated  the  unknown  speech  is 
estimated  from  the  degree  to  which  the  speaker's  Markov  model  fils  the  speech  data. 
It  will  be  interesting  to  see  if  either  of  these  approaches  leads  to  practically  useful 
systems. 

I  would  like  to  i  iok  now  at  the  second  serious  problem  encountered  in  automatic 
speaker  identification,  Lhat  of  transmission  degradations.  Researchers  have  tried  to 
find  features  in  the  speech  signal  that  have  some  resistance  to  these  degradations. 
Perhaps  the  most  resistant  of  all  such  features  is  fundamental  frequency,  the  ten 
dency  of  the  waveform  in  voiced  speech  to  repeat  itself  periodically.  The  repetition 
rate  is  clearly  unaffected  by  linear  or  non  linear  distortions,  and  it  should  remain 
observable  in  the  presence  of  moderate  amounts  of  steady  noise.  As  one  would  expect., 
then,  appropriately  chosen  algorithms  can  derive  reliable  statistics  of  fundamental 
frequency  even  from  heavily  distorted  speech  Mince  automatic  systems  cannot  deter¬ 
mine  the  syntactic  structure  of  the-  speech  samples  they  arc  given,  the  statistics  they 
derive  are  necessarily  of  the  absolute  kind  as  described  in  the  first  seel  ion  I  argued 
there  that  such  statistics  are  not  likely  to  be  rich  in  speaker  characterizing  inform  a 
tion,  and  it  is  indeed  found  to  be  the  case  that  they  are  not.  very  effective  in  speaker- 
identification,  particularly  with  the  short  samples  of  speech  that  one  must  expect  to 
work  with  in  practical  applications.  Moreover,  fundamental  frequency  is  probably  the 
most  mood  sensitive  feature  of  the  speech  signal,  and  in  many  situations  where 
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speaker  identification  could  be  useful  the  speaker  would  be  unlikely  to  be  in  a  normal, 
calm  state. 

Measures  of  the  behavior  of  the  power  spectrum  ought  to  be  less  mood  sensitive.  The 
power  spectrum  is  also  richer  in  information  than  fundamental  frequency  is.  The 
drawback,  however,  is  that  most  properties  of  the  power  spectrum  are  very  sensitive 
to  transmission  distortions:  there  is  no  point  m  trying  to  characterize  a  speaker  by  his 
long-term  spectral  average,  for  example,  if  that  spectrum  is  going  to  be  drastically 
and  unpredictably  modified  by  the  transmission  process 

Spectrum  shaping  linear  distortion  manifests  itself  as  a  set  of  frequency  dependent 
additive  constants  in  the  log  power  spectrum.  For  this  reason,  measures  of  the  varia¬ 
bility  of  the  short  term  log  power  spectrum  about  its  mean  have  been  proposed  as 
robust  speaker  characterizing  features  in  the  presence  of  linear  distortion  [26],  These 
measures  do  not,  however,  have  any  special  resistance  to  noise  or  to  non-linear  distor¬ 
tions,  and  I  know  of  no  speaker  identification  experiments  in  which  they  have  been 
successfully  used  on  speech  obtained  from  real  as  opposed  to  simulated  -  telephone 
links  including  carbon  microphones  (which  cause  non-linear  distortions).  My  own 
experience  using  real  telephone  speech  [27 j  has  been  that  such  measures  are  some¬ 
what  better  than  measures  of  the  long  term  spectral  average,  though  somewhat  worse 
than  statistics  of  energy  peaks  in  the  spectrum.  Even  the  results  with  energy  peaks, 
though,  did  not  approach  a  practically  useful  level  of  performance. 

It  seems  as  though  we  can  have  successful  text-independent  speaker  identification  on 
undistorted  material,  and  speaker  verification  results  show  us  that  we  could  probably 
have  successful  text-dependent  speaker  identification  on  transmission  distorted 
material;  but  effective  text-independent  speaker  identification  on  transmission- 
distorted  material  seems  so  far  to  be  beyond  our  grasp,  i  note  that  in  West  Germany 
dynamic  microphones  are  now  being  used  in  public  telephones,  and  I  wonder  whether 
a  trend  towards  the  elimination  of  carbon  microphones  and  the  increasing  use  of  digi¬ 
tal  transmission  might  not  mean  that  the  troublesome  distortions  will  be  substantially 
eliminated  before  workers  in  speaker  identification  learn  how  to  cope  with  them. 


A  summary  of  the  state  of  the  art 

I  would  like  to  conclude  by  summarizing  the  state  of  the  art  in  dealing  with  speaker 
differences. 

Several  speaker-independent  and/or  speaker-adaptive  speech  recognition  systems 
have  been  successfully  demonstrated,  and  a  few  commercial  systems  have  been  sold. 
In  the  next  few  years  we  should  see  commercial  recognition  systems  with  multi¬ 
speaker  capability  becoming  increasingly  common.  There  will,  however,  always  be  a 
proportion  of  the  population  who  cannot  use  a  particular  system,  either  because  of 
personal  peculiarities  in  their  voices  or  because  they  speak  a  form  of  the  language  too 
far  removed  from  the  forms  on  which  the  system  was  trained.  Equally,  I  believe  that  it 
will  remain  true  that  the  most  reliable  performance  will  be  obtained  by  training  a  sys¬ 
tem  on  the  voice  of  the  person  who  is  going  to  use  it. 

Speaker  verification  systems  have  been  successfully  demonstrated,  but  to  my 
knowledge  none  have  so  far  been  sold.  Their  commercial  appearance  may  be  linked 
with  the  large-scale  introduction  of  speaker-independent  recognition  systems  allowing 
fully  automatic  remote  access  to  information  banks  of  one  kind  or  another. 

In  my  opinion,  a  useful  level  of  automatic  speaker  identification  has  yet  to  be  convinc¬ 
ingly  demonstrated  on  fully  realistic,  transmission-degraded  material.  We  have  to  take 
into  account  the  fact  that,  since  the  effectiveness  of  a  speaker  identification  system 
would  possibly  be  enhanced  if  its  existence  were  unknown  to  the  target  group,  publica¬ 
tion  of  a  major  breakthrough  in  this  area  might  be  suppressed.  Nevertheless,  it  is  my 
guess  that  no  major  breakthrough  has  so  far  been  made.  It  may  be  that  automatic 
speaker  identification  may  be  eventually  rendered  feasible  not  by  progress  in  the 
identification  field  itself  but  rather  by  the  steady  improvement  in  the  quality  of 
speech  telecommunications  systems. 
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This  lecture  provides  an  overview  of  the  key  concepts  relevant  to  the 
evaluation  of  speech  transmission,  speech  synthesis  and  speech  recognition 
systems.  For  speech  transmission  systems  the  concept  of  intelligibility 
testing  is  introduced,  and  techniques  for  both  subjective  and  objective 
measurements  are  described  briefly.  It  is  then  pointed  out  that  very  few 
standard  tests  or  procedures  exist  for  assessing  either  speech  synthesis  or 
speech  recognition  systems.  So,  after  a  quick  look  at  the  problems  po3ed  by 
speech  synthesisers,  the  remainder  of  the  lecture  concentrates  on  automatic 
speech  recognition.  The  key  issues  in  speech  recogniser  testing  are 
discussed,  and  it  is  pointed  out  that  the  need  to  evaluate  such  systems  has 
raised  some  very  difficult  questions,  to  which,  as  yet,  there  are  few 
satisfactory  answers;  this  area  is  currently  a  major  research  topic  in  the 
speech  recognition  community.  It  is  shown  how  many  interrelated  factors 
affect  the  performance  of  a  speech  recognition  machine,  and  that  even 
interpreting  and  comparing  experimental  results  can  present  3ome 
difficulties.  Some  tentative  procedures  are  outlined  and  a  scheme  for 
estimating  the  relative  difficulties  of  different  vocabularies  is  described  in 
detail.  Finally,  it  i3  emphasised  that  evaluation  techniques  are  crucial  to 
the  satisfactory  deployment  of  automatic  speech  recognition  equipment  in  real 
applications. 
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1.0  SUMMARY 


Some  terms  for  description  of  speech  recognition  systems  are  defined.  A  selection 
of  real-time,  commercially  available  speech  recognition  equipment  is  described, 
concentrating  on  the  high-performance  end  of  the  market.  Likely  developments  are 
indicated . 

The  lectures  in  the  rest  of  this  series  have  concentrated  on  a  single  approach  to 
automatic  speech  recognition  -  that  using  whole-word  templates.  We  explain  current 
attempts  to  extend  the  capabilities  of  this  approach,  and  also  look  at  alternative 
approaches  which  are  the  subject  of  research  in  laboratories  around  the  world. 


2.0  INTRODUCTION 


This  chapter  is  intended  to  give  the  reader  an  idea  of  the  sorts  of  differences 
that  exist  between  current  speech  recognition  equipments  and  research  approaches. 

Anyone  considering  using  speech  recognition  in  a  real  application  such  as  in  an 
aircraft  cockpit  should  pay  attention  to  at  least  the  following  three  aspects.  Firstly, 
assuming  that  the  application  needs  a  high-performance  speech  recogniser,  ignore  the 
low-cost  end  of  the  market;  use  one  of  the  expensive,  bulky,  full-facility  equipments 
in  simulations  of  the  real  task,  to  find  out  what  is  required.  Secondly,  worry  about 
special  conditions  in  the  task  that  could  cause  problems.  Thirdly,  consider  how  an  ASR 
equipment  can  be  acquired  that  will  fit  into  the  space  available  and  interface  with  the 
other  equipment. 

This  survey  is  a  snapshot  at  a  particular  point  in  time.  I  shall  try  to  indicate 
the  way  that  things  are  moving,  but  you  can  assume  that  advances  in  micro-electronics 
will  make  available,  in  small,  affordable  packages,  anything  that  can  be  done  in  real 
time  now. 


3.0  TERMS  FOR  DESCRIPTION  OF  SPEECH  RECOGNITION  MACHINES 

The  list  below  concentrates  on  aspects  which  concern  the  designer  of  systems  which  are 
to  include  a  speech  recognition  equipment.  All  the  systems  below  are  Speaker  Dependent: 
this  means  that  they  are  meant  to  be  set  up  separately  for  each  speaker,  by  him  saying 
all  the  words  one  or  more  times  each. 


3.1  Vocabulary  size 

Much  confusion  has  been  caused  by  use  of  this  term  to  refer  to  very  different  aspects  of 
a  recogniser's  capabilities.  First  we  must  distinguish  between  word  and  templates. 
Usually  each  template  corresponds  to  one  word  (or  a  short  phrase  which  can  be  used  as  if 
it  were  a  single  word)  but  one  word  may  be  represented  by  several  templates. 

The  total  number  of  templates  that  a  system  can  hold  depends  only  on  the  amount  of 
memory  in  it,  and  its  addressing  range.  The  number  of  templates  that  the  machine  can 
compare  with  the  input  at  any  on<_  time  depends  on  its  processing  power.  But  the  number 
of  words  that  it  can  reliably  distinguish  between  depends  on  its  discrimination  power, 
the  similarity  of  the  set  of  words,  and  the  way  they  are  said.  The  discrimination  power 
i3  most  interesting,  but  there  are  no  established  ways  of  defining  it  (although  see 
Moore  [ 1  ] ) . 

In  practice  the  number  of  words  that  a  speaker-trained  system  can  use  is  often 
limited  by  the  need  to  acquire  the  template  data.  See  'training  procedures'  below. 
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3.2  Speaking  style 

The  first  commercial  speech  recognition  equipments  were  isolated  word  recognisers.  They 
required  the  user  to  pause  between  words  for  long  enough  to  mark  the  end  of  each  word. 
Because  many  words  contain  short  gaps  within  them,  the  gaps  between  words  had  to  be 
quite  long,  typically  300ms.  The  best  systems  impose  a  negligible  extra  delay  far 

processing . 

One  attempt  to  improve  things  was  called  'Quiektalk'.  The  user  still  had  to  leave 
a  gap  between  words,  but  it  could  be  smaller  than  the  largest  gap  in  a  word. 

Isolated-phrase  connected  word  recognition  systems  use  a  pause  (say  300ms)  to 
signal  end-o f-phrase ,  and  allow  the  words  to  be  spoken  without  pauses.  These  systems 
usually  have  a  maximum  length  of  phrase,  may  have  a  maximum  number  of  words  in  a  phrase, 
and  produce  their  answers  after  the  final  pause. 

Continuous  connected  word  recognisers  dispense  with  the  end-of-phrase  detection, 
deal  with  pauses  with  the  same  mechanism  as  is  used  for  word  recognition,  and  can 
produce  answers  while  the  user  is  talking. 

Isolated  word  mode  is  slow  and  unnatural.  Any  attempt  to  improve  input  speed  is 
bound  to  run  into  the  hard  limit  set  by  the  end-of-word  decision.  It  seems  that  the 
quiektalk  method  can  increase  input  rate,  but  merely  places  the  hard  limit  somewhere 
else.  I  have  no  data  on  the  relative  usefulness  of  the  two  connected  word  types.  It  is 
still  necessary  to  learn  a  technique  of  speaking  clearly,  consistently,  and  not  too 
fast,  but  connected  word  recognisers  seem  to  fail  more  gracefully  when  pushed  to  their 
1 imits . 

In  published  work  on  recognition  performance  [2,3,4]  the  best  isolated  word 
recognition  performances  have  been  by  connected  word  recognisers. 


3.3  Control  of  the  recognition  process 

To  get  the  best  performance  from  a  speech  recogniser  it  is  necessary  to  limit  the  number 
of  different  words  considered  at  each  point.  In  isolated  word  recognition  the  set  of 
words  to  be  considered  can  be  controlled  quite  easily.  Some  systems  allow  the  host 
system  to  specify  the  set  of  templates  acceptable  each  time.  Others  have  internal 
control,  based  on  previous  recognition  decisions. 

In  connected  word  recognition  it  is  not  possible  to  decompose  the  recognition 
process  into  a  sequence  of  decisions  -  the  identity  of  each  template  in  the  best 
sequence  can  depend  on  the  identity,  and  position  in  the  input,  of  all  the  others.  Some 
connected  word  recognisers  can  use  a  specification  of  the  order  in  which  the  templates 
must  appear  (i.e.  a  grammar,  or  syntax)  some  can  use  just  the  identity  of  the  templates 
that  are  acceptable  in  a  string,  and  for  some  it  is  only  possible  to  specify  the  number 
of  words  expected  in  a  string.  There  is  no  uniformity  yet  in  capabilities  or 
terminology  in  this  area,  but  there  is  no  doubt  that  used  wisely  these  techniques  can 
greatly  enhance  performance  in  difficult  conditions. 


3.4  Information  returned  from  the  recogniser 

The  most  basic  information  that  a  recogniser  can  return  is  the  serial  number  of  each 
template  recognised.  Many  recognisers  can  also  be  set  up  to  produce  an  arbitrary  string 
of  characters  when  a  template  is  recognised,  and  this  can  be  useful  when  driving  an 
application  program  which  was  designed  for  keyboard  input.  (A  recogniser  with  such 
capabilities  is  often  described  as  voice  input  terminal). 

A  more  specialised  application  program  might  be  able  to  make  good  use  of 
indications  of  reliability  of  recognition  (scores)  and  alternative  interpretations  (with 
scores).  The  latter  are  more  difficult  to  provide  in  a  connected  word  recogniser  than 
in  an  isolated  word  recogniser. 


3.5  Training  procedures 

The  process  of  acquiring  template  patterns  generally  goes  by  the  rather  confusing  name 
of  'training'.  Training  methods  are  crucial  to  the  success  of  template  matching 
recognisers,  because  they  have  no  other  source  of  information  about  the  set  of  words  to 
be  distinguished.  (Training  the  user  is  also  crucial  to  success,  but  is  outside  the 
scope  of  this  lecture). 

Some  systems  combine  several  training  utterances  of  a  word,  to  produce  one, 
averaged,  template.  Other  systems  keep  training  tokens  as  separate  templates,  and 
recommend  that  two  or  three  examples  are  used  for  the  more  difficult  words  (the  digits 
usually).  Systems  that  can  make  do  with  single  example  utterances  are  at  an  advantage 
in  applications  with  many  words  that  are  quite  distinct.  Systems  that  can  exploit  the 
availability  of  many  examples  of  each  word  are  at  an  advantage  in  applications  with  a 


relatively  small  number  of  difficult  words  that  must  be  recognised  reliably.  No  current 
machine  combines  the  advantages  of  both  methods. 

Some  systems  use  a  'robust'  training  procedure  that  refuses  to  accept  utterances 
which  are  very  different  from  previous  ones  [5].  Connected  word  recognisers  often  use 
isolated  words  to  form  templates,  but  it  seems  that  the  use  of  examples  from 
connected-word  contexts  can  give  better  results,  partioulary  if  information  about  the 
variability  of  different  parts  of  each  word  is  captured  and  used. 


3.6  Rejection  of  spurious  inputs 

In  its  most  basic  mode  a  recogniser  will  compare  any  sound  with  its  current  set  of 
templates,  and  choose  the  most  similar  template.  All  recognisers  have  some  facilities 
for  rejecting  sounds  that  are  very  different  from  all  the  templates,  and  sometimes  also 
when  no  one  template  scores  significantly  higher  than  all  the  rest.  There  is  usually  an 
adjustable  reject  threshold,  but  the  important  thing  is  how  well  the  recogniser  will 
reject  spurious  noises  and  'illegal'  words  while  accepting  valid  words.  There  are  no 
accepted  tests  for  this  aspect  of  performance. 


3-7  Size,  weight  and  cost 

These  factors  depend  on  whether  the  recogniser  is  a  chip  set,  a  board,  a  terminal  or  a 
complete  development  system.  The  amount  of  support  from  the  manufacturer  is  also  very 
var iable . 


4.0  DESCRIPTIONS  OF  A  SELECTION  OF  ASR  EQUIPMENTS 

Lea  has  produced  a  book  [6]  and  a  recent  article  [7]  on  selecting  recognisers.  The  list 
below  includes  all  connected  word  recognisers  known  to  me,  plus  two  high-per formance 
isolated  word  recognisers  which  are  of  special  interest. 


■4.1  NEC  DP200  (Nippon  Electric  Co. Ltd.,  Japan) 

A  successor  [8]  of  the  first  connected  word  recognition  machine,  the  DP100  [9].  A 
self-contained  isolated  phrase  connected  word  recogniser,  with  built-in  tape  storage. 
Vocabulary  size  50  to  150  word3  in  connected  mode.  Maximum  duration  of  phrase  4s.  Up 
to  5  connected  words  per  phrase.  Response  300ms  from  end  of  phrase.  The  set  of  words 
for  each  phrase  can  be  controlled.  One  or  two  one-utterance  templates  per  word.  Can 
use  connected  word  training. 


4.2  MSDS  SR128  (Marconi  Space  and  Defence  Systems,  U.K.) 

A  self-contained  isolated  phrase  connected  word  recogniser,  with  built-in  tape  storage. 
Template  memory  size  128  seconds.  Maximum  duration  of  phrase  10s.  No  limit  to  number 
of  words  in  a  phrase.  Response  300ms  from  end  of  phrase.  The  set  of  words  for  each 
phrase  can  be  controlled.  One  or  two  one-utterance  templates  per  word.  In  use  for 
flight  trials  in  civil  transport  aircraft.  Planned  to  fly  in  jet  fighter  in  1983. 


4.3  Logica  LOGOS  (Logica  Ltd.,  U.K.' 

An  equipment  [10]  designed  for  exper  5  on  applications  and  human  factors  aspects. 
See  lecture  'In3ide  a  speech  recognition  machine'  in  this  volume.  Vocabulary  store  100 
to  2000  templates.  Computation  power  for  25  to  200  templates.  Continuous  connected 
word  recognition  style.  Response  delay  depends  on  ambiguity  of  input:  typically  one  or 
two  words  delay.  Recognition  can  be  guided  by  word  order  syntax  within  phrases,  with 
optional  automatic  switching  to  alternative  syntaxes.  Normally  one  or  two  one-utterance 
templates  per  word.  Can  use  connected  word  training. 


4.4  Verbex  3000  (Verbex  Corp.,  USA) 

A  successor  to  a  connected  word  recognition  system  which  was  implemented  on  the  Verbex 
1800  system.  Using  the  1800  system  Verbex  demonstrated  very  impressive  connected  digit 
recognition  performance  on  recordings  made  for  the  US  Postal  Service  [3].  The  Verbex 
3000  is  aimed  primarily  at  the  industrial  materials  handling  market,  and  very  good  noise 
tolerance  is  claimed.  It  uses  a  statistical  word  model,  which  generalises  the  idea  of  a 
template.  Word  models  are  built  automatically  using  many  examples  of  each  word. 
Connected-word  training  material  is  normally  used  if  the  application  calls  for  connected 
word  input.  Syntax  control  within  phrases.  Continuous-type  algorithm,  but  output  is 
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produced  300tns  after  end  of  phrase. 


4.5  Votan  1000,5000  (Votan,  USA) 

A  fairly  new  isolated  word  recogniser  [11],  which  uses  dynamic  programming.  One  or  two 
utterances  per  word  for  training.  Template  memory  size  500  seconds,  maximum  logical 
vocabulary  size  256.  Good  performance  in  noise  is  claimed,  and  good  results  have  been 
reported  in  simulated  helicopter  noise  [12].  The  5000  also  includes  speech  storage  and 
replay. 


4.6  Vecsys  RMI88  (Vecsys  France) 

A  commercial  version  of  the  MOISE  experimental  system  from  LIMSI  [13].  A  fairly  new 
isolated  word  recogniser,  which  uses  dynamic  programming  .  Computation  power  for  up  to 
125  templates.  One  or  two  utterances  per  word  for  training.  Syntactic  selection  of 
vocabulary  subsets.  A  related  system  has  performed  successfully  in  a  jet  fighter 
aircraft.  A  new  version  will  offer  connected  word  recognition  [14]. 


5.0  A  SURVEY  OF  RESEARCH  IN  AUTOMATIC  SPEECH  RECOGNITION 

This  section  provides  brief  references  to  a  selection  of  current  research  efforts  on 
important  topics. 


5.1  Developments  of  whole-word  template  matching  (WWTM) 

Many  groups  are  trying  to  extend  WWTM  methods,  usually  based  on  the  dynamic  time  warp 
(DTW)  application  of  dynamic  programming  (DP). 


5.1.1  Connected  words  -  There  are  several  published  algorithms  which  solve  the 
mathematical  problem  of  finding  the  best  sequence  of  whole-word  templates  to  match  a 
given  unknown  speech  pattern  [15,16,17,18,19,20,21].  They  differ  in  amount  of  storage 
and  computatation  needed,  their  ability  to  include  in-phrase  syntax  control,  the 
availability  of  alternative,  sub-optimal  '  explanations'  ,  and  methods  for  'pruning'  the 
search  process  to  reduce  workload  without  significantly  compromising  the  ability  to  find 
the  correct  answer.  The  most  efficient  connected-word  algorithms  need  very  little  more 
computation  than  equivalent  isolated-word  algorithms. 


5.1.2  Many  talkers  -  The  favorite  method  at  present  is  to  use  several  (e.g.  12) 

templates  per  word,  chosen  in  an  attempt  to  cover  all  pronunciations  [22,23], 


5.1.3  Operation  with  difficult  signals  -  It  is  worth  distinguishing  between  distorted 
speech,  continuous  background  noise,  and  short  duration  high  amplitude  noises.  Some 
manufacturers  claim  that  their  recognisers  will  work  over  the  telephone  system,  which 
introduces  non-linear  and  linear  distortions.  Many  research  laboratories  are  working 
with  telephone  speech.  Continuous  background  noise  is  a  problem  in  aircraft  and 
elsewhere.  The  three  main  approaches  are:  subtraction  of  the  noise  waveform,  using  a 
second  microphone  and  special  filtering;  subtraction  of  estimated  noise  power  from  the 
input  spectrum;  and  allowing  for  the  presence  of  the  background  noise  when  comparing 
template  and  input  spectra. 


5.1.4  Large  vocabularies  -  There  are  many  problems  with  the  use  of  WWTM  for 
vocabularies  of  more  than  a  few  dozen  words  [24],  Some  groups  have  been  most  concerned 
with  the  training  problem,  and  have  resorted  to  the  use  of  general-pur pose  units  smaller 
than  words  [25].  Others  have  worried  about  the  computational  workload,  and  proposed 
initial  sorting  based  on  gross  structure  of  the  word,  c  more  fundamental  concern  is  for 
the  discrimination  power,  and  methods  have  been  proposed  for  building  into  the 
'templates'  far  more  information  about  the  variability  of  different  -ts  each  word 
[26,27  ]. 


5.2  Largely  automatic,  statistics-based  approaches 

The  IBM  Continuous  Speech  Recognition  team  has  been  the  main  exponent  of  statistically 
based  methods  for  many  years  [28].  Their  goal  is  transcription  of  limited  natural 
language  (e.g.  for  business  letter  dictation).  This  is  different  from  all  other 
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applications  considered  here,  where  the  user  is  assumed  to  be  prepared  to  use  an 
artificial  'language'  which  is  specific  to  the  task.  Some  of  the  techniques  have  been 
described  by  Baker  L2y,3uj.  Tne  veruex  1600  and  jOOO  systems  use  word-bused  statistical 
models  [31].  Recent  work  at  Bell  Labs  has  applied  statistical  methods  successfully  to 
speaker- independent  isolated  digit  recognition  [32]. 


5.3  Phonetics  based  approaches 

Many  attempts  have  been  made  to  use  speech  knowledge  in  the  design  of  speech  recognition 
machines,  but  without  much  success.  Reasons  for  this  failure  have  included  lack  of 
adequate  knowledge,  difficulty  in  converting  available  knowledge  into  a  form  useful  in 
speech  recognition,  and  excessive  complexity,  leading  to  difficulty  in  testing  and 
tuning  the  system,  among  groups  currently  attempting  to  produce  useful  recognisers 
based  on  phonetic  principles  are  the  National  Physical  Laboratory,  UK,  and  Thomson-CSF, 
Franc  j . 

A  significant  and  well-equipped  team  at  MIT  is  concentrating  on  identifying  and 
quantifying  specific  knowledge  about  how  the  acoustic  character istics  of  speech  sounds 
are  affected  by  their  context,  and  incorporating  this  knowledge  into  recognition 
procedures  [33].  The  goal  is  to  eventually  lift  the  barriers  to  speaker  independence, 
large  vocabularies  and  true  continuous  speech  recognition. 

Other  sources  of  speech-specific  knowledge  are  experimental  psychology  [39]  and 
auditory  neurophysiology  [35]. 


6.0  CONCLUSIONS 

The  whole-word  template  matching  method  is  likely  to  be  the  basis  of  practical  speech 
recognisers  for  some  years,  and  the  cost  and  size  of  recognition  components  which 
perform  as  well  as  the  best  current  systems  can  be  expected  to  fall  dramatically. 

The  best  new  systems  will  combine  features  of  straightforward  pattern  matching  and 
statistical  modelling.  For  applications  which  need  fast,  reliable  data  entry,  possibly 
in  difficult  conditions,  but  with  a  well-defined  task  and  dedicated,  trained  users,  such 
systems  will  be  very  suitable. 

Other  application  areas  will  need  recognisers  that  can  exploit  the  regularities  of 
speech  sound  structure  to  provide  large  vocabularies  and  the  ability  to  adapt  to  the 
speech  characteristics  of  a  new  speaker  in  a  general  way.  It  is  likely  that  a 
combination  of  pattern  matching,  statistical  modelling  and  phonetics  will  be  needed. 
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RESUME 


Les  etudes  d 1  application  des  techniques  de  reconnaissance  de  la  parole  aux  commandes 
de  fonctions  dans  les  avions  d'armes  viennent  d'entrer  dans  une  phase  active.  Les 
considerations  developpges  dans  le  texte  s'appuient  sur  des  experimentations,  passees 
ou  en  cours ,  au  simulateur  ou  en  vol. 

Les  principaux  th&mes  developpes  seront  les  suivants  : 

-  Les  aspects  techniques  :  la  prise  de  son,  le  bruit  et  les  problSmes  poses  par 
sa  variability,  1 ' importance  de  la  phase  d ' apprentissage  de  la  machine,  les  contraintes 
physiques . 


-  Les  aspects  operationnels ,  en  particulier  les  probl&mes  pratiques  pos§s  par 
1 1  integration  de  la  commande  vocale  dans  les  cabines  d' avion. 

-  Les  differentes  experimentations  effectufies  au  simulateur  et  en  vol  et  les 
enseignements  qu'il  est  possible  d'en  tirer. 

En  conclusion,  on  tentera  de  degager  les  perspectives  qui  s'offrent  a  ce  type  de 
technique  dans  le  domaine  aeronautique. 

1  -  PRESENTATION  GENERALE 

1-1  Le  dialogue  vocal  dans  un  avion  d'armes 

Les  travaux  entrepris  depuis  quelques  annees  sur  1 ' application  de  la  commande  et  de  la 
synthdse  vocale  au  dialogue  pilote  -  systeme  dans  un  avion  d'armes  ont  suffisamment 
progresse  pour  qu'il  soit  envisageable  de  mettre  en  oeuvre  ces  techniques  sur  les 
avions  de  combat  de  la  prochaine  generation. 

L'objectif  poursuivi  est  de  permettre  au  pilote  de  mieux  se  concentrer  sur  l'essentiel 
de  sa  mission  dans  un  environnement  operationnel  et  technique  de  plus  en  plus  complexe. 
Le  but  devra  Stre  atteint  par  une  nouvelle  conception  du  dialogue  pilote  avion,  integrant 
harmonieusement  commandes  vocales  et  manuelles  ;  ce  dialogue  devra  3  la  fois  etre  plus 
riche  et  plus  souple,  il  devra  permettre  un  contrOle  facile  des  systemes  et  des  capteurs 
et  offrir  au  pilote  des  possibilites  de  perception  et  d'analyse  rapides  et  completes 
des  situations  tactiques  ainsi  qu'une  bonne  capacity  d ' anticipation . 

Cependant  1 ' uti lisation  de  la  commande  vocale  se  heurte  encore  3  Dlusieurs  difficultds, 
qui  sont  d'ordre  technique  (sQrete  de  fonctionnement)  et  fonctionnel  (integration  dans 
une  cabine)  et  qui  devront  Stre  resolues  dans  les  prochaines  ann&es. 

1-2  Rappel  des  techniques  util J sees 

Les  techniques  de  reconnaissance  ayant  fait  jusqu'3  present  l'objet  d ' exper imentat ion 
ou  de  tentatives  d ' experimentations  dans  des  milieux  reels  -  et  plus  particul ierement 
en  aeronautique  -  appartiennont  toutes  3  la  famille  des  reconnaissances  monolocuteur 
dites  "globales”  ou  encore  "acoustiques "  de  mots.  Leur  caracteristique  commune  est 
de  baser  la  reconnaissance  de  mots  sur  la  comparaison  entre  la  forme  globale  d'un  mot 
qui  vier,t  d'etre  prononce,  et  les  formes  similaires,  dites  "references"  disponibles 
er.  memoire.  Ceci  se  traduit  par  la  necessity  de  creer  ces  references  lors  d'une  phase 
particuli£re  d ' uti 1 isa t ion  de  la  machine,  dite  phase  d ' apprenti ssage . 

Jusqu'3  present,  la  Socidte  CROUZET  a  mis  en  oeuvre  au  simulateur  et  en  vol  une 
technique  de  reconnaissance  de  mots  isoles,  dans  un  mode  de  reconnaissance  dit  "a  micro 
commande",  qui  utilise  un  bouton  poussoir  d 'activa'- ion  sur  le  manche  ou  sur  la  manette. 

Les  prochaines  experimentations  utiliseront  toutefois  une  technique  de  reconnaissance 
de  mots  connect£s. 


2  -  LES  ASPECTS  TECHNIQUES  IMPORTANTS 


2-1  Les  differentes  sortes  de  bruit  et  les  probldmes  de  prise  de  son 
2-1-1  Le_bruit_dans_la_cabine 
Le  bruit  cabine  a  plusieurs  origines  : 

-  Le  bruit  du  moteur,  qui  est  transmis  directement  ou  par  1 ' interm6diaire  de 
structures  rigides.  On  peut  encore  distinguer  les  regimes  3  sec  des  regimes  avec  post 
combustion. 

-  Le  bruit  aerodynamique ,  qui  est  fonction  de  la  vitesse  et  de  1' altitude  de 
1' avion  ;  3  haute  vitesse  et  basse  altitude  {Vi  >  450  Kts)  ce  bruit  peut  devenir 
preponderant  sur  les  autres  sources. 

-  Le  bruit  dfl  a  la  pressurisation  et  3  la  climatisation  de  la  cabine,  ce  bruit 
est  trds  variable  et  peut  a  l'occasion  se  reveler  important. 

-  Le  bruit  des  machines  eiectriques  :  moteurs ,  venti lateurs ,  transformateurs . 

D'une  fa^on  generate,  le  bruit  d'un  avion  d’armes  periju  dans  la  cabine  est  extrSmement 
variable  d'un  avion  3  l'autre  :  il  existe  des  avions  bruyants  et  des  avions  silencieux. 
Pour  rendre  leurs  avions  plus  silencieux,  pour  amSliorer  le  contort  des  pilotes  mais 
surtout  la  quality  des  transmissions  radio,  les  constructeurs  effectuent  des  analyses 
spectrales  et  temporelles  fines  du  bruit  dans  la  cabine. 

Ces  analyses  detaillees  sont  indispensables  3  1* identification  des  sources  de  bruit  et 
3  leur  traitement  ;  il  n'est  pas  evident  que  ce  soit  le  cas  pour  la  reconnaissance  de 
parole,  pour  trois  raisons  : 

-  le  bruit  regnant  dans  la  cabine  n'est  pas  celui  pergu  par  le  microphone  de 

masque , 


-  la  repartition  spectrale  du  bruit  est  large  et  il  serait  illusoire  d’esp£rer 
agir  localement, 

-  l'energie  du  bruit  et  sa  repartition  spectrale  sont  suffisamment  variables 
(parfois  au  cours  d'un  mSme  vol)  pour  qu'il  soit  plus  profitable  de  s'interesser  3  son 
enveloppe  et  3  son  mode  de  variation  qu'3  sa  composition  fine,  sauf  si  on  dSsire  agir 
sur  les  sources  de  bruit  elles-mfimes. 

A  titre  d'exemple,  le  niveau  de  bruit  le  plus  eieve  mesure  au  point  fixe  dans  la  cabine 
de  1' avion  qui  supportait  notre  experimentation  (Mirage  III)  etait  de  106  dBA.  Aucune 
mesure  n'a  ete  faite  en  vol,  mais  le  bruit  croissait  rapidement  avec  la  vitesse  et 
dependait  de  l'altitude  (maximum  pour  540  Kts,  10000  ft)  ;  le  bruit  occupe  toute  la 
bande  spectrale,  avec  un  maximum  entre  500  et  2500  Hz. 

2-1-2  Le_bruit_dans_le_masgue_3_oxY2§2® 

Ce  bruit  est  tr3s  different  du  bruit  regnant  dans  la  cabine,  qui  se  trouve  amorti  par 
1' enveloppe  de  caoutchouc  du  masque  ;  ce  dernier  reste  pourtant  present. 


La  source  de  bruit  la  plus  importante  est  li§e  3  la  respiration  du  pilote. 

-  i'effet  sonore  produit  par  la  circulation  gazeuse  dans  1 'espace  confine  du 
masque  et  des  tuyaux  et  3  travers  les  clapets  est  important,  il  peut  de  plus  Stre 
different  en  fonction  des  types  de  masque, 

-  le  microphone  est  souvent  fixe  dans  le  masque  en  face  de  la  bouche  et  du  nez 
du  pilote  et  re<;oit  directement  le  souffle  d' expiration  de  ce  dernier. 


Le  bruit  de  respiration  est  particulidrement  gSnant  car  : 

-  il  apparalt  brutalement,  avec  un  niveau  d'ei.ergie  eieve 

-  il  occupe  lui  aussi  la  bande  spectrale  de  parole  et  plus  particul iSrement  les 
hautes  frequences 

-  il  peut  al£atoirement  se  superposer  3  un  mot  ou  le  prolonger. 

Si  une  oreille  humaine  distingue  facilement  ce  type  de  bruit  de  la  parole  auquel  il  se 
melange,  il  n'en  va  pas  de  mSme  pour  une  machine  de  reconnaissance  automatique  de  parole, 
qui  ne  sait  generalement  pas  faire  la  distinction  entre  les  deux  types  de  signaux. 


2-1-3  Les _microghone s_e t_ les _gr obi Sines _de_grise_de_son 


.  Les  capsules  raicrophoniques  elles-mSmes  sont  en  genSral  satisf aisantes  sur  le  plan  de 
la  dynamique  et  de  la  bande  passante  (supSrieure  3  5000  Hz) ,  et  leur  courbe  de  rSponse 
est  souvent  bien  adaptSe  aux  cas  d'utilisations  :  oar  exemple  les  microphones  de  masque 
prSsentent  une  accentuation  de  la  courbe  de  rSponse  dans  les  basses  frSquences  pour 
pallier  un  af faiblissement  dQ  au  masque  lui-mSme. 

.  Les  micros  differentials  permettent  d'abaisser  la  sensibilitS  aux  sources  de  bruit 
Stendues  ou  lointaines. 

.  Le  signal  Slectrique  dSlivrS  par  le  micro  doit  etre  un  signal  de  fort  niveau,  de 
maniSre  3  Stre  le  moins  possible  sensible  aux  perturbations  Slectriques. 

Les  caracteristiques  nominales  des  microphones  modernes  sont  done  en  general  correctes , 
mais  par  contre  : 

-  il  peut  y  avoir  des  dispersions  sur  les  micros  utilis§s  et  leur  montage  dans 
le  masque 


-  le  signal  du  microphone  chemine  souvent  de  fagon  compliqude  par  des  systSmes 
d ' interconnexions ,  les  telephones  de  bord,  les  contrftles  de  gains  automatiques ,  etc..., 
qui  peuvent  le  d§naturer  et  perturber  la  reconnaissance.  On  aura  done  int§r§t  3  Stablir 
une  liaison  filaire  directe  et  bien  protegee  entre  le  micro  et  le  calculateur  de 
reconnaissance  de  parole. 

.  La  prise  de  son  ne  depend  pas  uniquement  des  caracteristiques  du  micro,  mais  egalement 
du  masque  et  de  son  montage  sur  le  casque.  Les  positions  g§om£triques  relatives  de  la 
capsule  et  de  la  bouche  sont  3  considSrer.  Les  micros  actuellement  utilises  en  vol  par 
exemple,  sont  places  directement  devant  et  la  bouche  et  le  nez  du  pilote  et  sont  done 
beaucoup  plus  sensibles  aux  souffles  produits  par  1 'expiration  ou  1 ' inspiration  que 
s'ils  Staient  places  lateralement. 

2-2  Le  melange  du  bruit  et  de  la  parole  et  son  traitement 
2-2-1  L^intensite_du_bruit 

On  a  pu  obtenir  un  fonctionnement  satisfaisant  de  la  reconnaissance  automatique  de 
parole  dans  des  environnements  trSs  bruitfis  (atteignant  109  dBA,  ou  115  dBC) . 

La  proximity  du  micro  et  sa  directivity  fait,  m§me  dans  ce  cas,  qu'il  est  possible  de 
travailler  avec  des  rapports  signal/  bruit  utilisables  (environ  10  dB)  dans  la  mesure 
oil.au  moins  sur  de  courtes  periodes  (duree  de  la  prononciation  d'un  mot),  le  bruit  peut 
etre  considere  comme  stationnaire . 

L‘ intensity  du  bruit  n'est  done  pas  forcyment  le  seul  crit&re  3  prendre  en  compte  dans 
le  probldme  de  la  reconnaissance  automatique  de  parole  dans  le  bruit. 

2-2-2  La_variabi lity_du_bruit 

Une  grande  partie  du  probldme  reside  dans  la  variability  du  bruit  et  de  ses  modes 
d'apparition. 

II  existe  des  variations  lentes  du  bruit,  dues  par  exemple  aux  variations  de  vitesse 
et  de  rygime  ay rodynamique  de  1 'avion  ;  mais  il  peut  ygalement  apparaltre  de  fagon 
brusque  dans  plusieurs  circonstances  : 

-  Glissement  sous  facteur  de  charge  d'un  masque  mal  ajusty.  Le  bruit  rygnant 
dans  la  cabine  peut  alors  §tre  pergu  de  fagon  beaucoup  plus  forte  par  le  micro. 

-  Le  b”uit  de  respiration.  Dans  l'espace  confiny  du  masque  3  oxygdne  et  des 
dispositifs  as-  ocies,  l'ycoulement  gazeux  est  canalise  a  1 ' inspiration  et  3  l'expiration 
et  emet  un  bruit  caractyr istique ,  qui  prolonge  ou  se  superpose  3  la  parole.  Dans  certains 
cas,  lorsque  le  masque  n'est  pas  correctement  applique  sur  le  visage,  la  dypression 
regnant  dans  la  cabine  peut  conduire  3  un  dybit  permanent  et  bruyant. 

Dans  les  cas  que  nous  venons  de  citer,  ce  n'est  pas  tant  1' intensity  du  bruit  qui  est 
'anante  que  sa  variability.  MSme  faible,  il  peut  pour  la  machine  de  reconnaissance 
•  r i litre  comme  un  constituant  du  signal  utile  de  parole  ;  il  est  en  g§nyral  d'autant 
[■jus  difficile  de  distinguer  du  bruit  dans  un  signal  de  parole  que  ce  bruit  ne  possyde 
oas  de  caracteristiques  stables  permettant  d'effectuer  une  prediction. 

De  plus  il  peut  se  faire  que  m3me  3  l'oreille  le  bruit  se  distingue  mal  de  certains 
sons  vocaux  comme  les  fricatives  ou  les  sifflantes  par  exemple. 
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2-2-3  Les_ so lutions_ possibles 

Le  melange  du  bruit  et  de  la  parole  pose  deux  types  de  probldmes  3  une  methode  de 
reconnaissance  de  mots  Isolds  : 

-  la  detection  de  debut  et  de  fin  de  mot 

-  la  reconnaissance  proprement  dite  une  fois  sa  detection  faite. 

.  La  detection  de  debut  et  de  fin  de  mot 

En  general,  et  surtout  s'il  s'agit  de  bruit  non  stationnaire ,  1 ' utilisation  d'un  simple 
seuil  base  sur  un  niveau  d'dnergie  ne  donnera  pas  de  bons  resultats.  II  faudra  plutot 
rechercher  un  critdre  base  sur  la  difference  de  nature  entre  la  parole  et  le  bruit. 

Bien  souvent,  meme  dans  le  cas  de  bruits  non  stationnaires ,  la  variability  spectrale  du 
bruit  est  infdrieure  3  celle  de  la  parole.  Des  critSres  de  declenchement  bases  sur 
cette  propriety  ont  donne  de  bons  resultats.  II  reste  toutefois  difficile  de  determiner 
la  fin  de  certains  mots,  surtout  des  mots  courts,  dont  la  prononciation  se  termine  par 
une  forte  expiration  qui  se  melange  3  eUx  et  les  prolonge.  C'est  le  cas  en  frangais  pour 
des  mots  courts  se  terminant  par  une  sifflante  ou  une  fricative,  exemple  :  six,  neuf . . . 

On  peut  cependant  noter  que  1 ' entrainement  du  locuteur  et  le  fait  qu’il  soit  averti  de 
ces  probldmes  est  un  facteur  favorable  5  la  maltrise  de  1' elocution. 

Pour  les  cas  vraiment  difficiles  qui  restent,  la  tendance  serait  plutot  de  rechercher 
des  algorithmes  qui  ne  necessitent  pas  de  detection  precise  de  debut  et  de  fin  de  mot, 
c'est  3  dire  des  algorithmes  de  type  reconnaissance  de  mots  enchaines. 

.  La  reconnaissance  proprement  dite 

Une  fois  la  detection  de  debut  et  de  fin  de  mot  resolue,  on  peut  envisager  deux  cas 
de  signaux  de  parole  bruites  :  le  cas  ou  le  bruit  peut  Stre  consid6re  comme  stationnaire, 
et  celui  oO  il  ne  l'est  pas. 

Dans  le  premier  cas,  le  bruit  peut  Stre  considere  comme  stationnaire  si  ses  caractdris- 
tiques  ne  varient  pas  pendant  toute  la  prononciation  du  mot,  ceci  s' applique  aux  bruits 
d'origine  aerodynamique,  qui  evolue  lentement.  De  bons  resultats  ont  ete  obtenus  sans 
traitement  particulier  pour  des  rapports  signaux  sur  bruit  suffisamment  grands  (cf  2-2-1) 
Des  techniques  simples  de  soustraction  de  spectre  de  bruit  ont  egalement  ete  expdri- 
mentees  avec  succds . 

A  noter  v.i  point  important  :  une  ambiance  sonore  eievee  entraine  une  deformation  de  la 
voix  du  locuteur  si  celui  ci  la  pergoit  de  fagon  trop  importante  (si  la  protection  du 
casque  est  insuf f isante) ;la  tendance  naturelle  du  locuteur  est  en  effet  de  retablir 
un  rapport  signal/bruit  a  peu  prds  constant.  Ces  deformations  sont  vites  sensibles 
et  perturbent  la  comparaison  dynamique.  Deux  precautions  sont  3  prendre  : 

1)  verifier  que  la  protection  du  locuteur  est  3  la  fois  suff isante  et  constante. 

2)  Faire  1 ' apprentissage  dans  des  conditions  representatives  d’un  cas  moyen. 
Ulterieurement  1 ' apprentissage  adaptatif  offrira  peut  Stre  des  solutions  plus  souples. 

Les  probldmes  de  la  reconnaissance  de  la  voix  deformee  en  ambiance  bruitde  sont 
fondamentaux  et  lies  3  la  representativitd  des  paramdtres  extraits  lors  de  1' analyse 
du  signal  et  3  la  quality  de  la  comparaison  dynamique. 

Dans  le  deuxidme  cas,  la  reconnaissance  est  difficile  et  les  progrds  3  faire  sont  d’ordre 
fondamental,  ils  concernent  la  discrimination  du  bruit  et  de  la  parole  melanges,  et 
1 'amelioration  de  la  comparaison  dynamique.  Si  la  parole  et  le  bruit  ne  sont  pas 
superposes,  mais  temporellement  juxtaposes  (bruit  du  souffle),  il  est  possible 
d'amdliorer  les  critdres  de  selection  des  paramdtres  deiivres  par  l'analyse  du  signal. 

Dans  les  essais  en  vol  actuellement  en  cours  les  probldmes  de  reconnaissance  dus  au 
bruit  rdsultaient  dans  une  proportion  importante  (30  %)  d'erreurs  de  detection  de  d6but 
et  de  fin  de  mot. 

2-3  L' apprentissage  des  references 
2-3-1  Llimgortance_de_l^aggrentissa2g 

Dans  les  mdthodes  de  reconnaissance  globales,  surtout  en  mots  Isolds,  1' importance  de 
1 'apprentissage  est  fondamentale.  En  effet  le  niveau  des  performances  atteint  par  la 
suite  depend  de  la  bonne  representativite  des  formes  acoustiques  acquises  lors  de 
1 'apprentissage,  c'est  3  dire  de  leur  ressemblance  acoustique  avec  les  mots  tels  qu'ils 
sont  ef fectivement  prononces  en  vol. 


Cette  exigence  de  similitude  n'est  pas  tres  grande  tant  qu'on  fait  de  la  reconais- 
sance  de  parole  dans  des  conditions  de  laboratoire,  mais  apparalt  clairement  d£s 
qu'on  aborde  des  milieux  plus  rSalistes,  tels  que  le  simulateur  ou  le  vol  sur 
avion  d'armes. 

Rappelons  que  les  techniques  de  reconnaissance  de  parole  disponibles  aujourd'hui 
ont  des  procedures  d ' apprentissage  variSes  qui  different,  en  particulier  sur  le 
nombre  de  passes  d 1 apprentissage  necessaires  (10  constituent  un  maximum).  La 
methode  utilisee  par  Crouzet  ne  necessite  qu'une  seule  passe  d ' apprentissage . 

L' experience  nous  a  montre  que  : 

.  La  phase  d' apprentissage  presentait  un  certain  caractSre  aieatoire  (peut 
etre  lie  au  fait  qu'il  se  fait  en  une  passe  unique) 

.  Plusieurs  facteurs  influaient  sur  la  bonne  representativite  des  references 
issues  de  la  phase  d 1 apprentissage .  Ces  facteurs  touchent  1 1 environnement  et  sent 
d'ordre  acoustique  et  ergonomique. 

Le  facteur  acoustique  :  il  nous  parait  necessaire  que  1 ' apprentissage  soit  effectue 
dans  une  ambiance  bien  representative  du  niveau  de  bruit  moyen  de  1 ' application ,  ce 
qui  signifie  qu'il  s'agit  de  l'ambiance  reelle  ;  dans  notre  cas,  1 ' apprentissage 
est  fait  dans  le  cockpit,  moteur  en  route,  ou  mieux,  en  vol. 

Les  facteurs  ergonomiques  : 

-  l'enonce  d'une  liste  de  mots  devient  vite  fastidieuse,  et  le  naturel  de 
1' elocution  s'en  ressent  surtout  si  la  liste  est  longue.  De  ce  point  de  vue,  la 
possibilite  d'une  unique  passe  d ' apprentissage  est  favorable. 

-  L ' apprentissage  doit  etre  effectue  avec  le  meme  materiel  et  dans  la  m§me 
ambiance  que  ceux  de  1 ' utilisation  ulterieure. 

A  titre  d'exemple  et  pour  illustrer  1 ' importance  de  ces  differents  facteurs,  on  peut 
citer  le  fait  que  le  pourcentage  de  mots  reconnus  en  vol  a  ete  ameliore  de  faijon 
importante  (+  5  %)  3  partir  du  moment  oO  1 ' apprentissage  ne  s' est  plus  fait  au  sol, 
moteur  tournant ,  mais  directement  en  vol. 

2-3-2  L2,aggrentissage_6vglutif 

Pour  plusieurs  raisons,  il  serait  souhaitable  de  faire  evoluer  l'ensemble  initial 
des  references  issues  de  1 ' apprentissage . 

-  Certaines  de  ces  references  peuvent  3  1' usage  se  reveler  mediocres. 

-  L 'entrainement  progressif  du  locuteur  le  conduit  a  parler  d’une  fa£on  souvent 
diff§rente  de  celle  qu'il  avait  lors  de  1 ' apprentissage . 

-  En  utilisation,  les  conditions  d ' environnement  -  surtout  le  bruit  -  peuvent 
evoluer  par  rapport  3  celles  de  1 ' apprentissage . 

Il  peut  done  s'averer  necessaire,  pourobtenir  un  niveau  de  performances  £leve  : 

-  soit  de  retoucher  le  jeu  de  references  initial 

-  soit  de  recrSer  un  ensemble  de  references  a  partir  de  mots  prononces  en  vol 
lors  de  phases  de  reconnaissance,  ce  qui  necessite  de  les  conserver  en  memoire 

-  soit  d'utiliser  des  procedures  automatiques  pour  faire  evoluer  les  references 
et  les  adapter  en  permanence  aux  evolutions  de  1 ' environnement  ou  de  la  voix  du 
pilote.  De  telles  procedures  sont  actuellement  3  l'etude. 

2-4  Les  contraintes  physiques  et  physiologiques  : 

2-4-1  Les_acceierations 

Les  effets  des  accelerations  sur  la  reconnaissance  de  parole  ont  ete  etudies  dans 
un  domaine  presume  utile  allant  jusqu'3  environ  4,5  g.  Cette  limite  a  ete  choisie 
pour  deux  raisons  : 

-  il  est  difficile  de  soutenir  sur  Mirage  III  des  virages  continus  a  des 
inclinaisons  plus  fortes,  or  l'etude  necessite  qu'un  nombre  suffisant  de  mots  soient 
prononc€s  sous  des  facteurs  de  charge  relativement  constants. 

-  Les  facteurs  de  charge  plus  eieves  correspondent  3  des  manoeuvres  particu- 
lieres,  generalement  de  combat. 


Dans  les  phases  de  combat,  toutes  les  commandes  dont  le  pilote  a  besoin  se  trouvent 
regroupdes  sous  ses  doigts,  sur  la  poignde  de  manche  et  la  manette  des  gaz.  Seul  l'appel 
de  quelques  paramfetres  particuliers  serait  justifiable  de  hauts  facteurs  de  charge. 

Schdmatiquement ,  les  probiemes  resultants  d ' accelerations  verticales  importantes  (mises 
en  virages)  sont  de  deux  types  : 

-  Les  probiemes  de  respiration. 

-  Les  probiemes  de  bruit. 

Les  accelerations  affectent  le  cycle  respiratoire  ;  le  pilote  ne  respire  plus 
normalement,  mais  inspire  avant  de  parler,  se  contracte  et  parle  de  fagon  souvent 
rapide  et  saccadde.  Les  plosives  initiales  sont  plus  marquees  ;  le  souffle  de  l'expi- 
ration  soutenue  peut  se  superposer  de  fagon  plus  importante  &  la  fin  des  mots.  Cependant 
jusqu'S  4  g  ces  effets  sont  encore  faibles  et  peu  ddcelables,  mime  &  l'oreille,  et  il 
semble  possible  d'aller  plus  loin. 

L' augmentation  du  bruit  est  par  contre  assez  sensible  ;  la  cause  principale  en  est  le 
glissement  du  masque  (voir  §  2-2-2) ,  les  effets  secondaires  dtant  dus  aux  modifications 
du  bruit  d'origine  adrodynamique  dans  la  cabine  (augmentation  de  1 ' incidence) . 

Actuellement ,  nous  obtenons  sous  facteur  de  charge  allant  jusqu'd  4  g  le  taux  de 
reconnaissance  de  l'ordre  de  5  %  infdrieur  d  ceux  obtenus  en  palier. 

II  faut  remarquer  que  la  reconnaissance  sous  facteur  de  charge  peut  dtre  influencde 
par  une  forte  dispersion  du  comportement  des  pilotes  dans  ces  conditions. 

2-4-2  Les_autres_facteurs 

Nous  n'avons  pas  analyse  sdrieusement  d'autres  facteurs  d' influence  que  les  accele¬ 
rations  hormis  1‘ altitude  et  le  Mach,  qui  ne  semblent  pas  se  traduire  par  des  effets 
apprdciables. 

Un  facteur  influent  pourrait  Stre  la  peur,  mais  il  n'y  a  aucune  raison  a  priori  de 
penser  que  la  peur  affecterait  davantage  l'dlocution  que  la  facultd  de  raisonner,  la 
motricitd,  l'habiletd  manuelle  ,  etc  ...  Si  1 ' utilisation  de  la  commande  vocale  permet 
de  simplifier,  de  rationnaliser  et  de  clarifier  les  procedures  de  commande,  de  permettre 
par  un  dialogue  plus  riche  une  meilleure  analyse  et  une  meilleure  anticipation  des 
situations  tactiques,  pourquoi  ne  pas  penser  au  contraire,  gu'elle  apporterait  une 
aide  supplSmentaire  ? 
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LES  ASPECTS  OPERATIONNELS 


•  3-1  Le  gain 

Les  diffdrents  aspects  du  gain  escompte  ont  dtd  fvoaues  en  introduction  :  rappelons 
les  principaux  : 

.  Meilleure  concentration  du  pilote  sur  sa  tSche  3  court  terme,  particuli&rement 
quand  le  champ  visuel  est  frdquemment  sollicitd  par  l'extdrieur  (combat,  attaque  au  sol, 
approche,  patrouille  serree...)  ou  par  les  visualisations  electroniques . 

.  Augmentation  de  la  securite  dans  le  cas  de  pilotage  difficile,  oil  la  vision 
de  l'exterieur  est  essentielle  (vol  basse  altitude,  phases  d'attaque,  patrouille). 

.  Augmentation  du  degre  d 1  interactivity  du  pilote  avec  les  svstdmes  embarques. 

.  Gain  de  temps  et  plus  grande  facilite  dans  1 'execution  des  procedures  de 
comma nde . 


.  Augmentation  des  possibility  d '  amenagerr.ent  de  la  cabine,  notamment  dans  les 
cas  oil  on  envisage  1 ' uti lisat ion  de  siSges  tres  inclines. 

.  Diminution  du  nonbre  de  commandes  manuelles. 

II  ne  sera  toutefois  possible  d'obtenir  ces  gains  et  de  tirer  pleinement  partie  de  la 
commande  vocale  que  lorsqu'une  integration  correcte  dans  la  cabine  pourra  Stre 
realisee . 

3-2  L 1  integration  de  la  commande  vocale  dans  une  cabine  d ' avion-d ' armes 

La  commande  vocale  s'adresse  essentiellement  aux  avions  dont  la  cabine  reste  3  concevoir 
il  est  illusoire  d'en  espdrer  un  gain  important  dans  une  cabine  equipde  trop  tradition- 
nellement.  Plusieurs  considerations  interviennent  : 

3-2-1  Le_garallelisme_des_grocedures_vocales_et_manuelles 

La  commande  vocale  n'etant  pas  absolument  fiable,  elle  ne  peut  dtre  le  moyen  unique  de 
commander  une  fonction  ;  toute  commande  vocale  doit  done  pouvoir  egalement  dtre  rdalisee 
manuellement.  Les  commandes  manuelles  etant  ici  uti lisfies  moins  fr6quemment,il  est  possible 
d'augmenter  leur  centralisation  et  leur  degrd  de  multiplexage  et  de  diminuer  le  nombre 
de  postes  de  commandes  specialises  ;  les  deux  moyens  de  commande  sont  alors  sur  un 
rndme  pied  d'egalite,  et  leurs  utilisations  se  ressemblent.  Dans  ce  contexte,  les 
commandes  manuelles  ne  sont  pas  des  secours  des  commandes  vocales,  mais  chaque  procedure 
de  commande  peut  etre  realisee  a  la  voix  ou  manuellement.  De  plus,  une  sequence  de 
commandes  peut  Stre  entamee  a  la  voix  et  poursuivie  manuellement,  ou  1' inverse. 

3-2-2  La_comgatibilite_mdcanigue 

Les  organes  mecaniques  des  commandes  qui  peuvent  etre  rdalisees  manuellement  ou  3  la 
voix  ne  peuvent  §tre  quelconques  ;  les  poussoirs  3  enfoncement  possedant  deux  positions 
mecaniques  stables,  les  basculeurs  et  les  rotacteurs  sont  exclus  ;  les  dispositifs 
utilisables  ne  doivent  pas  avoir  plusieurs  positions  mecaniques  stables  ;  on  utilisera 
des  touches  3  appui  fugitif  et  eclairement,  des  commandes  incrementales ,  etc... 

3-2-3  La_souglesse_de_dialogue 

Les  experimentations  au  simulateur  et  en  vol  nous  ont  montrd  que  les  procedures  de 
commande  vocale  devaient  soigneusement  §tre  mises  au  point  pour  eviter  les  difficultds 
de  fonctionnement .  Nous  en  avons  retird  plusieurs  enseignements ,  oar  exemple  : 

-  La  reconnaissance  de  mots  enchaines  est  tres  souhaitable.  Elle  permet  : 

.  de  se  rapprocher  de  l'elocution  naturelle 

.  de  rdaliser  des  commandes  complexes  (les  commandes  avec  introduction 
de  donndes  numeriques  par  exemple,  sont  longues  et  difficiles  3  rdaliser  avec  des 
mots  isolds) 

.  d'augmenter  les  possibility  de  la  commande  vocale. 

-  L'obtention  d'un  bon  niveau  de  sdcurite  ndeessite  l'emploi  d'un  bouton 
d'ouverture  du  micro. 

-  Dans  1 ' utilisation  combinde  du  bouton  d'ouverture  du  micro  et  de  la  phase  de 
commande,  les  hesitations,  les  silences,  et  les  retards  du  locuteur  doivent  dtre 
possibles  sans  probldme. 


3-3-1  L^aggrentissage_3  bord 


L'obtention  de  bonnes  performances  nous  paralt  aujourd'hui  nScessiter  que 
1 1 apprentissage  se  fasse  3  bord,  et  si  possible,  en  vol.  II  y  a  13  une  contrainte, 
mais  il  est  probable  que  cet  apprentissage  se  fera  lors  de  vols  d'entrainement  ou  de 
transformation.  De  plus  1 ' apprentissage  effectue  par  le  pilote  sera  valable  pour  tous 
les  avions  du  mSme  type  sur  lesquels  il  sera  susceptible  de  voler. 

3-3-2  Le_suggor t_des_ref Srences 

Les  references  issues  de  1 ' apprentissage  sont  attachSes  au  pilote  et  doivent  pouvoir 
Stre  stockSes  sur  un  support  quelconque  :  module  mSmoire  Slectroniques ,  badge  magnetique , 
cassette,  . ..,  qui  lui  est  personnel,  qui  doit  pouvoir  Stre  transports  facilement  et 
insSrS  avant  chaque  vol  dans  un  lecteur  situS  sur  1 'avion  ;  il  doit  Stre  possible  d'en 
obtenir  rapidement  une  copie  en  cas  de  perte. 

La  gestion  d'un  tel  SISment  entraine  des  contraintes  qui  ne  sont  pas  nSgligeables ,  mais 
les  moyens  qu'elle  suppose  ne  sont  pas  considerables  devant  ceux  que  n&cessiteront 
demain  la  preparation  d'une  mission. 

3-4  Les  fonctions 


Les  fonctions  prSsentes  sur  avion  d'armes  dans  lesquelles  il  sera  intSressant  d'inclure 
des  procSdures  vocales  sont  trSr  nombreuses,  mais  rien  de  prScis  ni  de  dSfinitif  ne 
peut  Stre  avancS  aujourd'hui. 

Citons  simplement  3  titre  d'exemple  des  fonctions  qui  peuvent  etre  concernSes  : 

-  communication 

-  identification 

-  navigation 

-  preparation  armement 

-  gestion  des  capteurs  (en  particulier  dans  les  modes  d'attaque  Air-Air  et 
Air-Sol) 

-  gestion  de  visualisation 

-  changement  de  modes 

-  consignes  de  pilotage 

-  interrogation  de  paramStres 

-  etc. . . 

4  -  L' EXPERIENCE  EMBARQUEE  "EVA" 

4-1  Objectlfs  expSrlmentaux 

Les  essais  en  vol  d'un  systSme  de  reconnaissance  de  mots  isolSs  venaient  aprSs  une  Stude 
en  simulateur,  od  1' aspect  opSrationnel  avait  StS  abordS . 

L' experimentation  d'EVA  (Equipement  Vocal  des  Avions)  avait  pour  but  la  mise  au  point 
et  la  validation  dans  1 ' environnement  reel  de  la  technique  de  reconnaissance  de  mots 
isoies . 

4-2  Description  d'EVA 
4-2-1  MatSr iel_embargu§ 

Le  materiel  embarque  en  pointe  avant  du  Mirage  III  R  n"  306  du  Centre  d 'Essais  en  Vol 
de  Bretigny  se  compose  de  : 

-  Un  bottier  de  dialogue  vocal  (1/2  ATR  court)  ;  il  realise  la  reconnaissance  de 
mots  isoies  et  la  synthSse  des  messages  et  des  pannes. 

-  Un  lecteur/enregistreur  de  cassettes  numeriques  Qantex.  Ces  cassettes 
contiennent  : 

.  les  references  du  pilote  issues  de  1 ' apprentissage 

.  la  forme  numerique  de  tous  les  mots  prononces  en  vol  (sonagrammes) 

.  3  paramStres  de  vol  (altitude  cabine,  le  facteur  de  charge,  le  roulis) 

-  L' alimentation  du  Qantex 

-  Un  magnetophone. 


En  cabine,  un  boltier  de  commande  et  de  visualisation  situS  en  haut  de  la  planche  de 
bord  canalise  le  ddroulement  de  toutes  les  procedures  de  dialogue  vocal  et  assure 
leur  contrOle.  II  permet  : 

-  La  mise  sous  tension  d'EVA 

-  1'arrSt  de  la  synthSse 

-  le  choix  du  mode,  apprentissage  ou  reconnaissance 

-  la  visualisation  sur  ecran  3  cristaux  liqviides  (2  lignes  de  20  caracteres) 
des  rdsultats  de  reconnaissance  et  d ' informations  diverses  liSes  au  dialogue. 

4-2-2  Banc_sol 

Un  banc  spScifique  permet  d'effectuer  sur  le  site  : 

-  La  preparation  des  cassettes 

-  La  premidre  analyse  et  1' edition  d’un  vol  d'essai. 

4-3  Les  fonctions  concernees  sur  1' avion 

La  mise  au  point  de  la  reconnaissance  de  mots  isoles  dans  un  environnement  r£el 
nScessite  la  presence  de  boucles  de  commande  rSalistes  ;  elles  sont  indispensables  3 
la  motivation  du  locuteur. 

Les  fonctions  concernees  sont  les  suivantes  : 

Appel  de  paramStres  :  le  pilote  peut  demander  vocalement  la  valeur  de  certains 
paramStres,  cette  valeur  est  alors  simultan6ment  affichde  sur  le  boltier  de  commande 
et  visualisation  et  synthStisSe  dans  le  casque. 

-  le  mach 

-  1' altitude 

-  1' incidence  et  le  roulis 

-  la  vitesse  propre,  et  la  vitesse  indiquSe 

-  la  distance  et  le  reldvement  par  rapport  3  une  balise  sdlectSe 
_  le  facteur  de  charge 

-  le  carburant  restant. 

Selection  de  frequences  radio  UHF  :  Les  frequences  radio  peuvent  Stre  : 

-  appelSes  par  un  nom  de  code 

-  compos§es  (chiffre  par  chiffre) 

-  mises  en  service. 

Autocommande  (stabilisateur  de  trajectoire)  :  II  est  possible  : 

-  d'embrayer  et  de  debrayer  le  stabilisateur  de  trajectoire 

-  d'embrayer  et  de  dSbrayer  la  tenue  d 'altitude. 

SynthSse  de  pannes  :  10  pannes  font  l'objet  d'une  synthSse,  chaque  message  de  panne 
peut  Stre  active  individuellement  par  un  interrupteur  situS  sur  un  panneau  de  commande 
spScifique . 

SynthSse  de  changement  d'etat  : 

-  Sortie  et  verrouillage  du  train 

-  Transfert  de  reservoir  de  carburant 

-  DSbrayage  de  1 'autocommande . 

Le  vocabulaire  nScessaire  3  1' execution  de  ces  fonctions  est  compose  de  22  mots,  plus 
des  10  chiffres. 


4-4  Deroulement  des  essais 

4-4-1  Coordina tion_essais_-_§tudes 

EVA  poss£de  un  caractdre  trSs  experimental  et  les  essais  en  vol  interagissent  profon- 
d6ment  avec  les  etudes  en  laboratoire. 

La  mise  au  point  de  la  reconnaissance  de  mots  isol6s  dans  1 ' environnement  cabine  utilise 
deux  niveaux  d1 analyse  des  resultats  : 

-  sur  le  site  k  l’aide  du  banc  sol  (premier  diagnostic) 

-  dans  les  laboratoires  de  la  society,  a  Valence. 

Les  donnees  acquises  en  vol  y  sont  integralement  conservees  ;  elles  sont  utilisees  en 
simulation,  pour  mettre  au  point  une  technique ,  en  etudier  les  modifications,  et 
ameiiorer  les  performances. 

4-4-2  Ddroulement_des_essais 

Les  essais  se  sont  ddroulds  sur  environ  40  vols,  3  deux  pilotes. 

On  peut  approximativement  y  distinguer  les  Stapes  suivantes  : 

-  Vols  exploratoires  avec  vocabulaire  et  domaine  de  vol  restreints. 

-  Vols  de  mise  au  point  (mise  au  point  de  la  mSthode,  du  dialogue  fonctionnel, 
et  du  vocabulaire) . 

-  Vols  d'ouverture  du  domaine  de  vol. 

-  Vols  de  mesure. 

Compte  tenu  des  evolutions  continuelles  de  1‘expSrience  au  cours  de  la  mise  au  point, 
pour  la  plupart  des  vols  1 ' apprentissage  s'effectuait  dans  1' avion,  soit  avant  le  vol, 
soit  au  tout  debut  du  vol.  Seuls  les  vols  de  mesure  se  sont  effectuSs  sans  apprentissage 
en  utilisant  des  references  stockSes  sur  une  cassette. 

4-5  Rdsultats 

Les  r&sultats  obtenus  ont  dtd  jugds  de  fagon  satisfaisante  par  les  pilotes.  Plusieurs 
problSmes,  notamment  d ' acquisition  ont  dQ  §tre  resolus  durant  la  premiSre  phase  des 
essais  et  lors  de  l'ouverture  du  domaine  de  vol. 

On  peut  representer  Involution  des  performances  obtenue  par  la  figure  ci-dessous  : 

%  At  reconnaissance 


Cette  figure  montre  l'enveloppe  moyenne  des  resultats  d’essais  en  vol.  L1 allure 
generate  ascendante  est  typique  de  1 1  amelioration  au  cours  du  temps,  due  3  1‘ accumu¬ 
lation  de  1'expSrience  et  3  une  mise  au  point  continuelle. 

Les  dispersions  3  l'intSrieur  de  l'enveloppe  restent  encore  3  analyser  finement.  Les 
facteurs  humains  y  sont  sans  doute  pour  une  part  importante. 

On  notera  la  brusque  deterioration  des  resultats  au  moment  de  l'ouverture  du  domaine 
de  vol  vers  les  facteurs  de  charge  eieves  et  les  grandes  vitesses. 

L' ascendance  de  la  courbe  nous  permet  d'esperer  de  nouveaux  gains  sur  les  resultats. 

5  -  CONCLUSION 

Les  perspectives  ouvertes  par  le  comportement  de  la  reconnaissance  de  parole  au 
simulateur  et  en  vol  sont  trds  interessantes  et  permettent  d'entrevoir  les  premieres 
applications  en  vol.  II  est  cependant  necessaire  de  travailler  activement  dans  au 
moins  deux  directions  : 

-  La  maltrise  de  la  securite  du  dialogue. 

.  Les  resultats  dej3  obtenus  en  vol  sont  tres  encourageants  et  montrent 
qu'il  est  probablement  possible  d'aller  plus  loin  et  d'obtenir  des  scores  plus  eiev§s 
encore . 

.  Cette  maltrise  ne  pourra  etre  obtenue  qu'3  travers  des  experimentations 
en  vol  intensives,  et  avec  le  plus  de  pilotes  possibles,  de  mani3re  : 

-  3  mieux  comprendre  la  diversite  des  problSmes  qui  se  posent,  meme 
s'ils  sont  apparemment  de  faible  importance, 

-  3  cerner  davantage  la  variabilite  du  signal  de  parole  et  du  bruit 

-  3  permettre  une  mise  au  point  continue  des  techniques  de 

reconnaissance . 

En  effet  la  reconnaissance  automatique  de  parole  est  un  des  domaines  oil  il  est  le  plus 
difficile  de  reproduire  et  d'Studier  en  laboratoire  les  conditions  d 'uti lisation  dans  un 
milieu  reel. 

-  L ' integration  des  commandes  vocales  dans  les  cabines  d'avions. 

Cet  aspect  nficessite  une  comprehension  approfondie  de  la  nature  et  des  possibility  du 
dialogue  vocal  afin  de  determiner  son  rflle  dans  un  cockpit.  Le  gain  operationnel  doit 
§tre  nettement  degage  ;  ceci  passe  vraisemblablement  par  une  conception  nouvelle  et 
globale  des  commandes. 

II  est  probable  que  dans  un  premier  temps,  la  commande  vocale  se  voie  attribuer  un  r61e 
limite  3  quelques  fonctions  (Radio  Communications  et  Radio  Navigation  par  exemple  ) 
puis  que  ce  rOle  soit  ulterieurement  etendu  3  d'autres  fonctions,  dans  le  cadre  de 
cabines  d'avions  radicalement  nouvelles. 

Si  ces  travaux  se  poursuivent  3  un  rythme  satisfaisant,  on  devrait  voir  deboucher  les 
premidres  applications  operationnelles  3  la  fin  des  annees  80. 
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1.0  SUMMARY 


The  aim  of  this  lecture  is  to  illustrate  some  of  the  speech  recognition  techniques 
that  have  been  presented  so  far,  by  concentrating  on  a  particular  speech  recognition 
system  that  the  author  knows  well.  This  system,  known  as  Logos  (from  the  Greek  for 
"word"),  is  designed  as  a  flexible,  high-performance,  experimental  machine  for  research 
on  recognition  methods  and  applications  aspects.  However,  it  can  serve  as  a  point  of 
reference  when  considering  more  practical  machines,  both  current  and  future. 

We  first  present  the  algorithms  for  connected  word  recognition  on  which  the  system 
is  based.  We  then  consider  some  of  the  practical  matters  that  can  be  important  in 
implementing  such  algorithms  in  computer  programs  and  special-purpose  equipment. 
Finally,  we  give  an  overview  of  the  hardware  system  architecture,  pointing  out  ways  that 
the  properties  of  the  algorithms  have  influenced  the  design. 


2.0  SYMBOLS  USED 


M 

R 

N(r) 

dlst( t , i  ,  j  ) 
C(t,i,  j) 

T  ( J  ) 

F(  J) 

L(t,i,J) 


Number  of  frames  in  the  input  pattern. 

Number  of  templates  in  use. 

Number  of  frames  in  the  r'th  template. 

Spectrum  distance  between  the  i'th  frame  of  the  t'th  template  and  the 
J ' th  input  frame . 

Sum  of  spectrum  distances  for  the  best  explanation  of  the  first  j  input 
frames,  leading  to  the  i'tu  frame  of  the  t'th  template. 

Identity  of  the  last  template  in  the  best  explanation  of  the  first  j 
input  frames. 

Input  frame  at  the  end  of  the  template  preceding  T(j)  in  best 
explanation  of  the  first  J  input  frames. 

Word  link  -  number  of  input  frame  corresponding  to  the  end  of  the 
template  preceding  the  t'th  template. 


3.0  THE  CONNECTED  WORD  RECOGNITION  ALGORITHMS 


The  obvious  way  to  apply  whole  word  template  matching  techniques  to  connected  word 
recognition  is  to  segment  the  input  into  words,  then  determine  the  identity  of  eacn 
word.  Because  segmentation  is  so  difficult,  the  alternative  approach  we  favour  is  to 
extend  whole  word  pattern  matching  to  deal  with  connected  words  in  a  natural  way, 
without  the  need  for  a  prior  segmentation  into  words.  We  define  the  best  word  sequence 
for  a  given  input  utterance  as  the  one  for  which  the  corresponding  templates,  when 
Joined  together  end-to-end,  make  a  "composite  template"  which  matches  the  input  pattern 
be3t  (i.e.  better  than  any  other  template  sequence).  This  is  illustrated  in  Fig.  1, 
where  the  sound  pattern  of  the  connected  sequence  "one- s i x- three- five- two"  is  displayed 
above  a  concatenation  of  the  corresponding  isolated  digit  patterns.  Connected  word 
recognition  using  whole  word  pattern  matching  techniques  will  be  possible  if  the  best 
word  sequence  corresponds  to  the  actual  words  spoken  often  enough  to  be  useful  and  if 
there  also  exists  an  efficient  algorithm  for  finding  the  best  word  sequence. 

We  describe  an  efficient  one-pass  dynamic  programming  algorithm  to  find  the 
sequence  of  templates  which  best  matches  the  whole  of  the  unknown  input  pattern  [1].  We 
find  that  this  approach  works  quite  well,  considering  its  naivety. 

In  the  following  description  of  the  connected  word  recognition  algorithm,  it  is 
assumed  that  the  input  and  the  template  patterns  are  represented  as  sequences  of  units 
which  are  called  "frames"  (from  vocoder  terminology).  In  most  of  our  work,  each  of 
these  frames  has  been  a  19-point  spectrum  cross-section.  Each  frame  in  Fig.  1 
corresponds  to  20ms  of  speech  signal. 


The  main  requirement  for  the  matching  algorithm  is  that  any  frame  of  the  input  can 
be  compared  with  any  frame  of  any  template  to  give  a  measure  of  "distance"  or 
"dissimilarity"  between  the  two  frames.  Some  distances  for  a  simple  example  are  plotted 
in  Fig.  2.  The  input  word  sequence  " one- three- two"  is  plotted  with  time  axis  horizontal 
and  templates  for  "one",  "two"  and  three"  are  plotted  with  time  axis  vertical.  A 
distance  of  zero  between  two  frames  (i.e.  identical  data)  is  displayed  as  the  largest 
size  of  black  dot,  and  a  range  of  other  distances  i3  displayed  by  using  smaller  dots. 
Outside  this  range  the  display  is  white. 


3.1  Dynamic  Programming  Scoring  for  Connected  Word  Recognition 


In  isolated  word  recognition  a  "time  registration  path"  maps  the  timescale  of  the 
input  on  to  the  timescale  of  a  single  template.  In  connected  word  recognition,  a  time 
registration  path  maps  the  timescale  of  the  input  on  to  the  timescale  of  a  sequence  of 
templates.  This  is  illustrated  in  Fig.  3  for  a  particular  sequence  of  templates.  This 
diagram  can  be  re-drawn  as  in  Fig.  M ,  where  it  can  be  seen  that  any  other  template 
sequence  could  also  be  accommodated.  In  Fig.  ^  any  valid  path  starts  at  one  of  the 
points  A,B,C  (the  start  of  the  input  and  the  start  of  one  of  the  templates)  and  ends  at 
one  of  the  points  X,Y,Z  (the  end  of  the  input  and  the  end  of  one  of  the  templates). 

Within  templates  the  time  registration  path  can  repeat  or  skip  template  frames,  but 
transitions  between  templates  are  always  to  the  start  of  one  template  from  the  ends  of 
those  templates  which  are  permitted  to  precede  it. 

Scoring  an  arbitrary  connected-template  path  is  very  similar  to  the  isolated-word 
case:  a  3imple  sum  is  formed  of  all  the  between-frame  distances  along  the  path.  The 
best  explanation  of  the  input  corresponds  to  the  best-scoring  path. 

The  connected  word  recognition  score,  C(t,l,J),  is  defined  as  the  sum  of  the 
distances  for  the  best  way  of  matching  the  first  j  input  frames  with  any  permissible 
sequence  of  templates  followed  by  the  first  i  frames  of  the  t'th  template  (Fig.  5). 
Thus  although  the  score,  C(t,i,J),  is  still  only  a  function  of  position  (j)  in  the  input 
and  of  position  (i)  in  a  template  (t),  as  for  isolated  word  recognition,  it  also  depends 
on  the  other  templates  and  the  way  they  might  explain  previous  parts  of  the  input. 

Within  each  template  the  same  basic  operrUon  is  performed  as  in  isolated  word 
recognition: 

C(t,i,j)  =  Minimum  C(t,i-a,j-1)  +  dist(t,i,j)  ....  (1) 
a  =  0 , 1  ,2 

where  dist(t,l,j)  is  the  spectrum  distance  between  the  i'th  frame  of  the  t'th  template 
and  the  j'th  input  frame. 

At  the  3tart  of  each  template,  the  ends  of  the  preceding  permissible  templates  must 
be  examined  and  the  best  score  selected: 

C ( t ,  1  ,  J )  =  Minimum  C ( r  ,  N ( r )  ,  J -  1  )  +  dlst(t,1,j)  .  .  .  (2) 
where  R  is  the  number  of  templates. 

Computation  proceeds  for  all  templates  in  parallel  in  one  forward  pass  through  the 
Input  pattern.  At  the  end  of  the  input,  the  score  for  the  best  interpretation  can  be 
found  by  examining  the  scores  at  the  ends  of  all  the  templates  that  are  permitted  to  end 
an  utterance. 

Fig.  6  shows  some  dynamic  programming  scores  computed  from  the  spectrum  distance 
data  for  the  3imple  example  displayed  in  Fig.  2.  The  best  score  for  each  input  frame  is 
displayed  as  the  largest  black  dot,  and  the  other  scores  for  the  same  input  frame  are 
displayed  relative  to  the  best  by  using  smaller  dots. 


3.2  Recording  and  Using  the  Word  Sequence  Information 


The  above  algorithm  find3  the  score  for  the  best  time  alignment  of  the  best 
sequence  of  templates  to  explain  an  unknown  connected  word  utterance.  However,  we  are 
far  more  interested  in  the  actual  sequence  of  templates  which  produced  this  best  score, 
so  during  the  main  pass  over  the  i,.put  we  must  also  keep  track  of  the  word  decisions 
along  all  the  current  time  registration  paths,  and  then  trace  them  back  at  the  end  of 
the  phrase.  The  information  about  these  word  decisions  forms  a  tree  structure  which 
"grows  new  branches"  as  the  unknown  input  is  processed.  Fig.  7  shows  a  simple  example 
of  such  a  "word  decision  tree"  which  will  be  used  below.  The  path  between  points  D  and 
A,  for  example,  corresponds  to  the  sequence  of  templates  TU ,  t8 .  There  are  many  paths 
in  the  tree,  some  of  which  have  ended  because  be t ter- scor lng  paths  have  been  chosen  in 
the  computation  of  (1)  and  (2).  The  tree  is  currently  being  extended  only  at  points  A, 
B  and  C.  When  the  end  of  the  input  is  reached,  the  best  path  and  the  corresponding  best 


word  sequence  can  be  traced  back  through  the  word  decision  tree.  Methods  for  creating 
this  tree  structure  and  recovering  the  best  word  sequence  are  described  below. 


3.3  No  Syntax 


Vintsyuk  [2]  suggested  the  following  method  of  recording  template  sequence 
decisions  for  the  simple  case  in  which  any  of  the  R  templates  can  follow  any  other.  The 
method  needs  three  arrays  which  we  shall  call  T,  F  and  L.  The  best-scoring  template 
ending  at  the  J'th  input  frame  is  defined  as 

T  (  j  )  =  ArgMin  C(r,M(r),J)  ....  (3) 

1  ^r^R 

where  ArgMin  means  the  value  of  the  index  which  minimises  the  expression.  The  second 
array,  F(j),  records  the  input  frame  corresponding  to  the  last  frame  of  the  template 
which  precedes  T(j).  The  data  structure  which  corresponds  to  the  word  decision  tree  in 
Fig.  7  is  illustrated  for  the  Vintsyuk  method  in  Fig.  8.  (The  j'th  input  frame  has  ju3t 
been  processed.) 

The  values  in  the  arrays  F  and  T  will  be  sufficient  to  recover  the  best  word 
sequence.  After  processing  the  last  frame  of  an  input  phrase  of  length  M  frames,  the 
final  template  in  the  best  sequence  is  T(M),  the  last  but  one  template  is  T(F(M)), 
preceded  by  T(F(F(M))),  and  so  on  until  the  template  corresponding  to  the  beginning  of 
the  input  is  reached. 

In  order  to  fill  the  array  F,  we  need  the  third  array,  L,  which  is  used  to  hold 
"word  links".  L(t,i,j)  holds  the  number  of  the  input  frame  corresponding  to  the  end  of 
the  template  previous  to  the  t'th  template,  determined  along  the  best  time  registration 
path  up  to  the  point  (t,i,J).  In  the  example  in  Fig.  9,  L(2,i,j)  holds  the  value  J2, 
which  is  the  input  frame  at  which  the  previous  template  ended,  along  the  best  path  to 
(2,i,j).  F(J2)  holds  the  value  J 1  ,  and  T(J2)  holds  the  template  number  T3. 

The  word  link  Information  in  L  propagates  with  the  scores,  30  that 


L(t,1  ,  j)  =  j-1  . (4) 

L(t,i,j)  =  L(  t ,  i-a ,  j-1  ) . (5) 


where  a  is  the  index  chosen  in  (1). 

For  each  input  frame,  the  corresponding  entry  in  the  array  F  can  now  be  made,  using 
the  value  of  L  from  the  end  of  the  best  scoring  template: 

F(J)  =  L(  T(J),N(T(J))  ,J  )  ....  (6) 

As  with  the  scores,  it  is  only  necessary  to  store  the  values  of  L  for  the  current  input 
frame . 


3 . 4  Using  Syntax 


In  many  applications  of  speech  recognition  there  is  some  knowledge  of  the 
permissible  order  of  speaking  the  vocabulary  words.  For  instance,  in  data  entry  to 
computers  the  words  must  normally  be  spoken  in  a  certain  order  (e.g.  a  control  word 
followed  by  some  data)  so  that  the  computer  can  process  the  information. 

The  network  in  Fig.  10  shows  an  example  of  a  set  of  word  sequence  rules  (i.e.  a 
syntax)  that  might  be  applicable  to  a  subtask  in  an  aircraft  cockpit.  All  utterances 
accepted  by  the  syntax  correspond  to  routes  through  the  network,  e.g.  "Waypoint  4", 
"Height  17434"  and  "Radio  frequency  37.25".  Th  "silence"  template  is  used  to  explain 
the  input  pattern  while  there  is  no  speech,  and  the  "reject"  template  deals  with 
utterances  and  noises  which  do  not  fit  the  rules.  (These  techniques  will  be  explained 
in  Sections  4.1  and  4.2.) 

We  could  specify  an  equivalent  syntax  (called  AIRSYN)  in  the  following  form: 

DIGIT  ={1/2/3/4/5/6/7/8/9/01 
WPC  =  {  waypoint  /  channel  )  DIGIT 

RF  =  radio  frequency  <  DIGIT  >  {  decimal  <  DIGIT  >  /  1 

HT  =  height  <  DIGIT  > 

AIRSYN  =  <  <  SILENCE  >  {  REJECT  /  WPC  /  RF  /  HT  1  > 

By  including  such  word  sequence  information  in  the  recognition  algorithm,  there  are 
several  advantages.  It  prevents  the  input  from  being  recognised  as  a  phrase  which  is 
just  nonsense  and,  by  computing  scores  only  for  those  sequences  of  words  which  "make 
sense",  the  amount  of  computation  nan  be  significantly  reduced.  Of  course,  if  nothing 
is  known  about  the  possible  order  of  the  words,  the  syntax  must  simply  allow  that  any 
word  can  follow  any  other  word. 
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The  method  of  recording  and  recovering  word  decisions  can  easily  be  extended  to  the 
case  of  finite  state  syntaxes  (i.e.  those  that  can  be  drawn  as  a  directed  graph,  with 
templates  in  the  arcs).  Junctions  between  templates  are  called  "nodes".  Vintsyuk's 
algorithm  deals  with  a  "syntax"  containing  only  one  node.  When  there  is  more  than  one 
node  we  need  to  record  the  template  selection  decisions  made  at  each  node.  More 
information  about  the  JSRU  method  can  be  found  in  Bridle  et  al.  [31. 


3.5  Continuous  Operation 


The  connected  word  recognition  algorithm  described  above  is  suitable  for 
recognising  discrete  utterances  when  augmented  with  an  algorithm  for  determining  the 
beginning  and  end  of  a  spoken  phrase.  However,  it  is  not  difficult  to  extend  the 
algorithm  to  operate  continuously.  This  can  avoid  the  need  for  explicit  endpoint 
detection  (see  Section  4.1),  and  it  allows  the  output  of  word  decisions  while  the  talker 
is  still  speaking. 

The  contents  of  array  L  for  the  current  input  frame  defines  those  places  where  the 
word  decision  tree  is  being  extended.  In  Fig.  11,  this  would  be  at  points  A,  B  and  C. 
The  paths  back  from  these  points  converge  at  point  D.  Thus  no  further  input  can  change 
the  decision  that  the  first  two  words  are  T5  and  TO,  and  this  decision  can  therefore  be 
output,  even  before  the  end  of  the  utterance. 


4.0  PRACTICAL  CONSIDERATIONS 


This  section  records  a  variety  of  techniques  and  considerations  that  arose  while 
designing  computer  programs  and  specifying  a  real-time  hardware  implementation. 

4.1  Detection  of  Utterance  Endpoints 


One  of  the  most  difficult  problems  in  isolated  word  recognition  is  determining 
where  the  word  starts  and  finishes  ("endpoint  detection"),  because  this  process  usually 
precedes  pattern  matching.  It  has  been  suggested  that  most  errors  made  by  speech 
recognition  machines  are  basically  endpoint  errors.  Our  technique  avoids  deciding  the 
positions  of  the  boundaries  between  words  before  deciding  the  identity  of  the  words,  and 
the  same  method  can  also  be  used  to  deal  with  the  utterance  endpoint  problem.  In 
"isolated  utterance"  mode  we  would  use  syntaxes  which  start  and  end  with  "silence 
templates",  which  are  simply  examples  of  the  background  noise  spectrum.  In  principle, 
it  i s  then  only  necessary  to  find  the  start  point  very  roughly,  to  move  back  far  enough 
to  include  some  silence,  and  to  start  recognition  using  the  initial  silence  template. 
The  final  silence  template  can  similarly  be  used  to  deal  with  the  end  of  the  utterance. 

In  practice,  our  real-time  machine  is  normally  run  in  continuous  recognition  mode, 
using  a  syntax  which  Includes  one  or  more  silence  templates  in  loops  for  explaining  the 
input  pattern  between  the  spoken  utterances.  In  Fig.  10,  for  instance,  the  "silence" 
template  will  explain  the  pauses  between  phrases,  but  extra  silence  templates  would  be 
used  if  we  expected  the  user  to  pause  between  words  in  the  phrase. 


4.2  Non-Vocabulary  Words 


It  is  possible  to  make  provision  for  the  speaker  uttering  words  that  are  not  in  the 
vocabulary.  In  isolated  word  recognisers,  there  is  usually  provision  for  rejecting  an 
utterance  for  which  the  scores  of  all  the  templates  are  worse  than  some  preset 
threshold.  For  connected  word  recognition  the  corresponding  operation  is  to  reject  a 
portion  of  an  utterance  while  accepting  the  rest.  Our  method  is  to  have  very  elastic 
"pseudo-templates"  which  always  produce  the  same  "distance"  when  matched  with  any  input 
frame.  By  incorporating  these  "wildcard"  templates  into  the  syntax,  spurious  inputs, 
such  as  breath  noises  and  unknown  or  out-of-context  words,  can  be  matched  (and  rejected) 
at  selected  points  in  the  syntax.  The  syntax  in  Fig.  10,  for  instance,  can  reject 
complete  spurious  utterances,  but  extra  wildcards  would  be  needed  to  deal  with  the 
spurious  sounds  which  could  occur  between  the  words  within  a  phrase.  The  distances  for 
wildcard  templates  have  to  be  chosen  carefully  to  make  it  unlikely  that  a  wildcard  will 
be  selected  when  a  word  in  the  permitted  vocabulary  is  spoken.  If  a  wildcard  is  chosen 
in  preference  to  the  correct  vocabulary  word,  it  implies  that  the  word  has  been  spoken 
significantly  differently  from  the  version  stored  in  the  template. 


4.3  Acoustic  Analysis 


The  acoustic  analysis  used  in  our  earlier  speech  recognition  work  was  based  on  the 
JSRU  channel  vocoder  [4],  which  was  designed  for  low-bit-rate  communications.  The 
analysis  could  therefore  be  assumed  to  give  a  compact  description  of  the  speech  signal 
while  preserving  at  least  enough  Information  to  allow  speech  communication.  In  fact  the 


analysis  channel  filter  bandwidths  and  spacings,  and  the  logarithmic  amplitude  scale, 
are  all  broadly  consistent  with  the  usual  psychophysical  models  of  the  auditory  system. 

The  standard  channel  vocoder  analysis  in  our  computer  program  produces  speech 
spectra  at  the  rate  of  50  frames  per  second,  but  the  analysis  that  is  currently  used  in 
the  real-time  implementation  produces  200  frames  per  second.  This  quantity  of  data,  if 
used  directly,  would  lead  to  a  very  large  amount  of  computation  (the  computation  rate  is 
proportional  to  the  square  of  the  frame  rate),  and  to  a  large  template  store.  A 
variable  frame  rate  procedure  [5]  has  therefore  been  adopted,  which  uses  all  input 
frames  when  the  spectrum  is  changing  most  rapidly,  and  omits  a  high  proportion  of  the 
frames  when  the  spectrum  is  relatively  constant.  Thus  the  variable  frame  rate  procedure 
also  has  the  benefit  of  emphasising  the  non- s tat ionary  portions  of  the  speech  sound 
pattern,  where  there  is  probably  most  linguistic  information. 


4.4  Spectrum  Distance  Measure 


The  acoustic  analysis  data  is  only  used  as  the  input  to  the  calculation  of  a 
distance  or  measure  of  dissimilarity  between  two  frames.  Ideally  this  distance  should 
not  be  greatly  affected  by  the  loudness  of  the  speech,  the  background  noise  or  the 
transmission  conditions  (e.g.  telephone  line,  position  of  microphone,  etc.),  but  it 
should  be  sensitive  to  important  differences  between  the  shapes  of  the  two  spectra.  In 
practice  we  have  used  the  square  of  the  simple  Euclidean  distance,  but  we  can  also 
incorporate  additional  processing  to  make  some  adjustment  for  both  amplitude  variations 
and  background  noise. 

The  vocoder  analysis  has  a  limited  dynamic  range,  and  in  speech  with  a  good 
signal-to-noise  ratio  the  spectrum  in  the  gaps  between  words  is  reasonably  constant.  As 
described  in  Section  4.1,  we  use  a  "silence  template"  to  account  for  these  gap3  between 
words,  and  this  normally  matches  well.  In  realistic  conditions,  the  background  noise 
can  have  a  relatively  high  level  and  an  arbitrary  spectrum  shape  and  can  also  be 
varying.  If  the  background  noise  is  varying,  we  need  to  estimate  the  background  noise 
continuously  and  to  modify  the  silence  templates  accordingly.  However,  background  noise 
also  affects  the  fit  of  all  the  word  templates,  and  therefore  we  include  some  additional 
processing  which  reduces  the  effect  of  the  background  noise  on  the  distance  measure. 

Klatt  [6]  proposes  a  number  of  improvements  to  a  simple  filter  bank  analysis.  In 
Klatt's  noise  compensation  method  the  distance  computation  needs  four  spectra:  input 
speech,  input  noise,  template  speech  and  template  noise.  In  our  method  we  first  compute 
a  weighting  function  for  each  speech  spectrum,  based  only  on  that  spectrum  and  the 
current  estimate  of  the  noise  spectrum.  The  spectrum  distance  calculation  combines  the 
two  speech  3pectra  with  their  weighting  functions  in  a  way  that  makes  full  use  of  the 
original  Information  and  provides  a  measure  of  the  amount  of  difference  between  the 
underlying  speech  spectra.  The  method  leads  to  a  particularly  efficient  hardware 
implementation  because  the  weighting  function  can  be  represented  using  very  few  bits, 
and  because  the  distance  calculation  can  be  pipelined. 


4.5  Computation  and  Storage 


The  notation  used  above  implied  that  the  scores  (C)  and  the  word  links  (L)  need 
three-dimensional  arrays  to  store  them,  but  because  of  the  order  in  which  the  processing 
is  done  there  is  no  need  for  the  input  frame  index  (j).  Consequently,  the 
implementations  store  one  score  and  one  word  link  for  each  template  frame.  Similarly, 
spectrum  distances  are  always  between  a  given  template  frame  and  the  current  input 
frame . 

Compared  with  an  Isolated  word  recogniser  with  the  same  size  vocabulary,  using  the 
same  methods  of  pattern  representation  and  dynamic  programming  matching  algorithm,  the 
above  connected  word  recognition  algorithm  takes  about  the  same  amount  of  computation 
per  input  frame,  although  it  needs  a  greater  amount  of  working  storage.  Because  all 
words  must  be  considered  in  parallel,  the  working  storage  for  the  scores  is  increased  by 
a  factor  of  the  vocabulary  size.  The  word  links  L  and  the  arrays  F  and  T  also  need 
extra  storage,  but  the  total  amount  of  working  storage  is  less  than  that  required  to 
hold  the  template  patterns  themselves. 

The  most  computationally- intensive  operation  is  the  calculation  of  the  spectrum 
distances,  but  this  is  a  very  regular  operation  which  is  well  suited  to  special, 
high-speed  circuitry.  The  spectrum  distance  calculation  needs  rather  rapid  access  to 
the  template  data. 


4.6  Score  Pruning  and  Scaling 


The  amount  of  computation  can  be  reduced  significantly  by  pruning  the  dynamic 
programming  scores  (Lowerre's  "Beam  Search"  [7]).  For  each  input  frame,  all  scores 
which  are  more  than  some  specified  distance  away  from  the  best  score  for  that  input 
frame  are  removed  from  further  consideration.  This  avoids  considering  relatively 
unlikely  interpretations  of  that  part  of  the  input  pattern  which  has  already  been 


processed,  but  keeps  the  options  open  if  there  seems  to  be  ambiguity.  Our  present 
computer  program  and  the  real-time  equipment  also  reduce  the  range  of  numbers  needed  to 
represent  the  scores  by  setting  the  best  score  for  each  input  frame  to  zero,  by 
suotractxon.  tig.  o  can  be  regarded  as  a  representation  of  the  actual,  modified  scores, 
with  the  white  areas  corresponding  to  "pruned"  scores. 


5.0  LOGOS  -  A  SPEECH  RECOGNITION  MACHINE 


Although  our  connected  word  recognition  algorithm  has  been  available  for  some  years 
in  a  non-real-time  computer  program,  our  experiments  have  been  limited  by  computation 
speed  and  insufficient  memory  to  store  more  than  a  dozen  templates.  To  enable  the 
pattern  matching  technique  to  be  explored  further,  a  powerful  and  flexible  real-time 
equipment,  based  on  our  algorithm,  has  been  designed  and  constructed  under  contract  by 
Logica  Ltd.  This  equipment,  known  as  "Logos",  offers  applications  research  laboratories 
the  capability  to  evaluate  the  use  of  this  type  of  speech  recognition  in  many 
appl ications . 

The  main  aim  was  to  achieve  flexibility  and  performance,  at  the  expense  of  storage 
and  computation  costs.  An  important  requirement  was  to  allow  interaction  between  the 
recognition  process  and  the  software  which  makes  use  of  the  recognition  results.  In  one 
direction,  the  constraints  of  the  application  can  be  used  to  guide  the  recognition 
process,  by  using  syntax  and  setting  various  parameters.  In  the  other  direction,  the 
output  of  the  recogniser  can  include  much  more  than  just  the  identity  of  the  templates 
in  the  best-fitting  template  sequence.  Other  information  which  has  been  made  available 
includes  durations  and  scores  for  the  intervals  of  the  input  which  are  "explained"  by 
each  template,  and  some  indication  of  alternative  ( sub-op t imal )  template  sequences, 
which  might  be  chosen  by  the  application  software  if  they  "make  more  sense". 

A  functional  overview  of  Logos  is  shown  in  Fig.  12.  The  acoustic  analysis  section 
includes  a  19-channel  filter  bank  analyser  with  much  better  amplitude  and  time 
resolution  than  the  vocoder,  plus  a  microprocessor  (the  Front  End  Processor).  The  front 
end  processor  implements  various  transformations  of  the  raw  filter  bank  spectrum 
cross-sections,  including  variable  frame  rate,  background  noise  spectrum  estimation  and 
the  first  stages  of  the  noise  compensation  algorithm.  There  is  a  buffer  of  several 
seconds  duration  between  the  acoustic  analysis  section  and  the  pattern  matching  section, 
to  allow  for  variable  computation  rates.  The  main  computational  load  is  in  the  distance 
calculation,  which  is  handled  by  a  high-speed  special-purpose  hardware  module.  The 
spectrum  distance  includes  the  noise  compensation  technique  referred  to  in  Section  4 
above.  Up  to  16  dedicated  microprocessors  share  the  work  of  the  dynamic  programming 
steps  at  the  heart  of  the  algorithm,  while  the  control  processor  keeps  track  of  the 
syntax  and  word  uecislons.  This  organisation  exploits  the  fact  that  the  calculations 
for  each  template  are  substantially  independent  of  the  calculations  for  the  others. 
Interactions  between  template  processing  only  occur  at  the  beginning  and  end  of  each 
template  and  in  the  score  pruning.  All  the  microprocessors  are  Intel  8086's.  More 
information  can  be  found  in  Peckham  et  ?1.  [81. 

Logos  has  been  designed  to  recognise  sequences  of  connected  words  using  a 
vocabulary  of  up  to  200  templates.  The  number  of  templates  that  can  be  stored  in  the 
machine  is  only  limited  by  the  amount  of  template  memory,  and  the  length  of  the 
templates.  Finite  state  syntax  with  loops  may  be  specified  to  guide  the  recognition 
process,  which  can  cope  with  an  average  of  about  100  words  "active"  at  any  one  time, 
depending  on  the  number  of  dynamic  programming  processors.  The  wildcard  facility 
(Section  4.2)  allows  the  equipment  to  deal  with  non- vocabul ary  words,  and  non-verbal 
noises  such  as  coughs  and  breath,  at  selected  points  in  the  syntax.  The  recognition  of 
key  words  can  control  the  switching  of  syntaxes  or  choice  of  vocabulary.  Continuous 
operation  (Section  3.5)  permits  Logos  to  handle  utterances  of  any  length. 

In  common  with  most  currently  available  recognisers,  Logos  is  essentially  speaker 
dependent,  requiring  each  user  to  provide  example  utterances  of  each  of  the  vocabulary 
word3.  Training  of  the  machine  typically  requires  the  user  to  speak  each  of  the 
vocabulary  words  at  least  once  in  isolation.  The  extraction  of  templates  is  done  using 
the  recognition  algorithm,  with  a  suitable  syntax  consisting  of  wildcard  and  silence 
templates.  By  specifying  a  more  complicated  training  syntax,  it  is  also  possible  to 
extract  templates  embedded  in  known  carrier  phrases  [9].  Logos  has  the  ability  to  store 
and  retrieve  templates  from  a  host  computer.  For  research  purposes,  the  attached  host 
may  be  placed  in  complete  control  of  Logos,  when  monitoring  of  recognition  performance 
and  acquisition  of  intermediate  results  are  possible. 


6.0  CONCLUSIONS 


The  two  main  features  of  our  current  approach  to  connected  word  recognition  are  the 
use  of  whole  word  templates  and  the  one-pass  dynamic  programming  algorithm  for  deciding 
the  identity  of  the  words  spoken.  It  is  a  3 impl e-mi nded ,  brute-force  approach  and  its 
performance  falls  far  short  of  that  of  a  human  listener,  even  when  the  machine  is  set  up 
to  suit  the  speaker.  However,  we  believe  that  machines  based  on  these  principles  will 
be  useful  in  many  voice  input  applications,  particularly  for  trained  operators  of 
complex  machines. 


These  methods  can  form  a  stepping  stone  for  future  research  in  automatic  speech 
recognition.  There  is  the  potential  to  improve  the  whole  word  pattern-matching  approach 
by  incorporating  more  information  about  the  words  in  each  template.  This  could  include 
information  about  permitted  timescale  distortion  [10],  and  about  the  variability  of  the 
spectrum  at  each  frame.  Even  for  quite  different  approaches  to  automatic  speech 
recognition,  the  one-pass  dynamic  programming  organisation  can  be  an  inspiration  as  a 
method  for  analysing  and  "decoding"  the  speech. 

The  experiments  which  we  have  carried  out  so  far  u3ing  whole  word  pattern  matching 
have  been  limited,  and  only  small  vocabularies  have  so  far  been  tested.  The  power  and 
the  limitations  of  these  methods  have  yet  to  be  ascertained  and  the  real-time  equipment 
that  is  now  becoming  available  will  help  to  explore  their  potential.  Possibly  more 
important  is  the  need  to  explore  the  consequences  of  using  automatic  speech  recognition 
in  many  different  application  areas.  It  is  hoped  that  the  present  generation  of  speech 
recognition  equipments  will  be  used  to  find  suitable  ways  of  using  speech  recognition  in 
complete  systems,  so  that  when  smaller,  cheaper  and  perhaps  better  speech  recognition 
equipments  become  available  in  the  future,  they  can  be  used  profitably  without  further 
delay . 
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Fig.  7.  Word  decision  tree  generated  tiring  recognition. 


Fig.  8.  Structure  of  Vintsyuk's  word  decision  data 
(indexed  by  input  frame  number). 


onethreetwo| 

1 _ _  ..  . . .  . 

Pig.  9.  Propagation  of  the  word  link  information. 

Fig.  10.  An  example  of  a  finite  state  syntax  with  loops. 
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Fig.  tl  Properties  of  the  word  decision  tree  which  are 
used  in  continuous  operation. 
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Fig.  12.  Functional  overview  of  "Logos’ 
a  real-time  continuous  connected 
word  recognition  system. 
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This  paper  is  designed  to  familiarize  those  involved  in  training  development  with 
the  nature,  constraints  and  applications  of  computer  voice  technology  (CVT).  It  will 
also  show  you  how  to  evaluate  voice  technology  for  meeting  training  requirements,  and 
how  to  incorporate  CVT  into  your  training  design. 

Let's  look  at  the  technology  -  the  "how-does-it-work"  of  computer  speech  genera¬ 
tion  and  voice  recognition.  This  will  not  be  a  highly  technical  discussion  for  two 
reasons.  First,  a  technical  discussion  would  require  a  highly  technical  background 
and  would  not  further  your  use  of  the  technology.  Second,  the  technology  is  continu¬ 
ally  diversifying.  It  is  becoming  increasingly  difficult  to  keep  track  of  the  vari¬ 
ous  agencies  and  vendors  involved  in  CVT,  no  less  their  individual  approaches,  tech¬ 
niques,  and  special  interests  and  applications.  Rapid  advances  in  language  analysis 
and  other  related  technologies  add  to  the  rapid  technical  growth  of  this  field. 

Computer  Speech  Generation  (CSG) 

Initially,  we'll  look  at  computer  speech  generation  (CSG)  which  is  a  simpler 
technology  to  begin  with  than  voice  recognition.  There  are,  essentially,  three  types 
of  CSG  -  digitized,  word  generated,  and  phoneme  generated  (see  Figure  1). 

Digitized  Speech 

Sounds  are  produced  as  wave  forms.  When  a  word  is  entered  into  the  computer,  the 
wave  form  is  digitized  and  becomes  a  pattern.  This  pattern  represents  all  of  the 
stress,  pitch,  and  pause  associated  with  the  word.  When  the  complete  word  pattern  is 
stored,  the  speech  that  is  generated  from  it  is  called  digitized,  and  the  resulting 
sounds  are  very  real  and  natural.  Storing  complete  word  patterns,  however,  requires 
a  lot  of  memory. 

Synthesized  Speech  -  Word  Generated 

In  order  to  save  computer  memory,  the  digitized  pattern  can  be  compressed,  and 
then  may  be  stretched  out  again  to  be  generated.  When  this  is  done,  however,  less  of 
the  word  pattern  is  stored,  the  sound  wave  becomes  slightly  distorted,  and  the 
results  sound  somewhat  metallic  or  mechanical.  This  speech  is  called  "word  gener¬ 
ated"  and  i s  one  type  of  synthesized  speech  -  speech  which  sounds  somewhat  synthetic 
rather  than  natural. 

Synthesized  Speech  -  Phoneme  Generated 

Another  type  of  synthesized  speech  is  phoneme  generated,  which  uses  the  least 
amount  of  computer  memory.  Phonemes  are  the  basic,  or  smallest,  phonological  ele¬ 
ments  of  speech.  All  of  the  words  in  most  languages  can  be  formed  from  combinations 
of  phonemes.  Word  patterns  are  formed  in  two  ways.  The  digitized  phonemes  for  a 
word  can  be  selected  by  the  computer  either  according  to  pronunciation  rules  or  by 
preprogramming  sequences  of  phonemes  which  are  combined  to  form  a  word  pattern.  This 
pattern  is  stretched  out  to  real  time  into  a  word  sound  wave.  The  size  of  the  avail¬ 
able  vocabulary  is  limited  only  by  programming  requirements.  Phoneme  generated 
speech  is  more  distorted  than  word  generated  speech  since  the  effect  sounds  have  on 
one  another  when  spoken  together  is  not  accounted  for.  This  is  the  speech  that 
sounds  robotic. 


*  This  paper  was  developed  from  a  portion  of  the  effort  completed  under  Government 
Contract  N61339-80-G-0003  with  Eagle  Technology,  Inc.  Opinions  expressed  herein 
are  those  of  the  authors  and  do  not  necessarily  reflect  official  policies  of  the 
United  States  Navy. 
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SYNTHESIZED  SPEECH  -  WORD  GENERATOR 


SYNTHESIZED  SPEECH  -  PHONEME  GENERATOR 


Figure  1.  Characterization  of  Computer  Speecn  Generation. 

CSG  Characteristics 

By  looking  at  these  different  sound  waves,  you  can  see  the  quality  difference 
computer  memory  can  make.  Phoneme  generated  speech  can  be  Improved  through  the  use 
of  filters  and  other  technical  advances  which  smooth  out  the  sound  wave.  This  speech 
Is  the  most  flexible,  offering  an  unlimited  vocabulary  at  low  cost.  Phoneme  gener¬ 
ated  speech  also  can  provide  multilingual  vocabularies. 

Word  generated  speech  offers  a  limited  vocabulary,  depending  on  the  available 
computer  memory  and  the  amount  of  compression  used.  The  more  compressed  the  word 
patterns  are,  the  lower  the  speech  quality.  Word  generators  use  up  to  five  times  the 
computer  memory  of  phoneme  generators  and  are  equivalent  in  cost  to  phoneme  genera¬ 
tors  as  they  trade  off  vocabulary  size  for  speech  quality. 

Digitized  speech  Is  much  more  sophisticated  than  synthesized  and  uses  at  least 
twice  as  much  memory  as  the  highest  quality  word  generated  speech.  Digitized  speech 
can  use  as  much  as  ten  thousand  bits  per  second  of  storage  where  phoneme-generated 
speech  can  use  as  little  as  four  hundred  bits  per  second  of  storage.  Digitized 
speech  requires  a  large  computer  capability  resulting  In  high  quality,  natural  sound¬ 
ing  speech  and  a  large  vocabulary  at  much  higher  costs. 

A  choice  of  which  speech  generation  method  to  employ  depends  upon  your  specific 
application  and  Involves  certain  trade-offs  between  vocabulary  size,  speech  clarity 
and  cost  in  terms  of  computing  power. 

COMPUTER  VOICE  RECOGNITION  (CVR) 


Now  let's  discuss  computer  voice  recognition  (CVR).  A  computer  can  be  controlled 
by  verbal  commands  through  the  addition  of  a  voice  recognition  capability.  This  type 
of  system  consists,  basically,  of  a  recognition  capability,  a  computer,  necessary 
Interfaces,  appropriate  software,  and  the  programs  to  incorporate  the  recognizer. 
Different  manufacturers  produce  these  systems  using  different  technologies.  Some 
build  computers  with  the  voice  component  built  in,  and  some  produce  voice  units  which 
can  be  interfaced  with  certain  computers.  Currently,  research  is  ongoing  to  produce 
a  voice  recognizer  on  an  electric  component  or  "chip." 
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CVR  Process 


Generally,  computer  voice  recognition  works  In  this  manner.  As  In  speech  genera¬ 
tion,  sounds  are  digitized  and  made  Into  patterns  by  the  computer.  The  computer  has 
a  set  of  patterns  stored  In  Its  memory,  which  it  compares  to  the  Incoming  voice 
patterns*  If  the  computer  can  find  a  close  enough  match  according  to  preset  proba¬ 
bility  parameters,  it  will  select  the  correct  item.  If  not.  It  will  not  recognize 
the  pattern.  In  most  systems  the  parameters  can  be  changed  by  the  user.  How  to 
change  them  is  a  part  of  the  design  issue  which  is  covered  in  more  detail  later. 

The  comparison  process  is  referred  to  as  pattern  matching. *  Most  voice  recog¬ 
nizers  currently  employ  pattern  matching  on  the  whole  word,  which  requires  storage  of 
whole  word  patterns  and  the  matching  of  the  pattern  of  each  Incoming  whole  word  with¬ 
in  some  probability  parameter.  Some  research  topics  in  pattern-matching  are  as 
follows : 

•  Word  spotting,  where  attention  is  focused  on  the  portions  of  speech  that 
distinguish  words. 

•  Matching  short-time  spectra,  which  matches  the  sequence  of  small  units  with 
similar  stored  units. 

•  Matching  spoken  phonetic  sequences  with  stored  phonetic  sequences  for  all 
possible  words. 

On  some  occasions  the  computer  may  misrecognize  a  pattern  and  respond  according¬ 
ly.  Words  which  sound  alike  can  cause  this  problem,  for  example  "run"  and  "one."  If 
one  of  these  words  cannot  be  eliminated  from  the  program,  then  the  computer  can  be 
taught  a  different  pronunciation  for  one  of  the  words.  In  this  case  "run"  can  be 
entered  as  "execute."  When  establishing  a  word  set,  the  vocabulary  should  be  checked 
for  words  that  can  easily  be  confused.  These  words  should  be  replaced,  if  possible. 

On  other  occasions  the  computer  may  not  recognize  a  word  at  all,  due,  usually,  to 
voice  changes  which  result  from  stress,  fatigue  or  illness.  In  this  situation  addi¬ 
tional  voice  patterns  need  to  be  entered  into  the  computer. 

Training  The  Computer 

Entering  voice  patterns  into  the  computer  ia  referred  to  as  "training  the  comput¬ 
er,"  or  utilizing  voice  data  collection  and  enrollment,  and  consists  of  Inputting 
repeated  entries  of  the  words  which  are  to  be  recognized.  The  computer  averages  the 
entered  patterns  into  a  representative  pattern  for  each  word.  When  non-  or 
misrecogni-  tion  occurs,  the  computer  must  be  "retrained,"  and  the  additional  Inputs 
are  averaged  into  the  existing  pattern.  The  type  of  training  required  by  a  system 
depends  on  whether  the  system  is  speaker  dependent  or  speaker  independent. 

Speaker  Dependence/Independence 

Speaker  dependent  systems  are  specifically  tailored  to  an  individual's  speech 
patterns.  Each  speaker  must  train  the  system  to  recognize  the  speaker's  voice  by 
repeating  each  word  in  the  vocabulary  from  one  to  ten  times,  depending  on  the  re¬ 
quirements  of  the  particular  system.  This  process  results  in  a  system  that  accepts 
variations  in  pronunciation.  If  the  computer  has  difficulty  with  non-  or  misrecogni- 
tion  of  any  word,  the  speaker  can  usually  correct  this  with  a  brief  retraining  of  the 
computer  on  that  word.  Non-  or  misrecognition  can  result  from  voice  changes  due  to  a 
cold,  the  time  of  day,  or  other  circumstances. 

Speaker  dependent  systems  are  primarily  useful  when  the  speaker  will  use  the 
system  to  repetitively  perform  a  task.2  This  type  of  system  is  also  useful  when 
the  task  requires  that  only  a  properly  qualified  person  perform  that  task,  such  as  in 
quality  assurance,  production  cost  accounting,  or  electronic  funds  transfer.  Many 
systems,  however,  have  the  capability  of  storing  several  speakers'  sets  of  speech 
patterns,  so  that  any  individual  speaker  can  call  up  his  or  her  patterns  into  the 
computer  and  use  the  system  with  little  or  no  retraining.  Different  speakers' 
patterns  can  also  be  stored  on  disc  or  tape  and  loaded  as  needed.  This  is  useful 
when  a  system  is  being  used  to  train  persons  in  a  task,  or  when  several  persons 
perform  a  Job  at  different  times,  such  as  in  shift  work. 

Speaker  independent  or  universal  systems  are  designed  to  recognize  the  voices  of 
persons  for  whom  no  previous  voice  samples  have  been  supplied.  In  experimental 
systems,  this  function  is  accomplished  by  constructing  a  dictionary  of  reference 
patterns  that  model  the  peculiar  speech  characteristics  of  each  word.  Commercial 
systems  incorporate  Independent  speaker  recognition  by  constructing  a  very  large  data 
base  from  hundreds  of  speakers  in  order  to  appropriately  model  all  of  those  with 
diverse  patterns.  In  sophisticated  systems  of  this  type,  the  patterns  may  be  adapted 
automatically  for  better  recognition  performance  as  a  new  speaker  continues  to  use 
the  system. 3  These  systems  generally  are  less  accurate,  operate  with  smaller 
vocabularies  than  speaker  dependent  systems  and,  in  practice,  may  not  work  for  all 
speakers.  While  unlveisal  recognition  is  a  desirable  goal  for  speech  recognition, 
the  loss  of  accuracy,  the  additional  complexity  of  the  recognition  logic,  and  the 


fact  that  not  all  people  will  be  understood  generally  limit  its  use  to  only  those 

systems  which  are  designed  for  public  use.  In  those  situations  it  is  not  reasonable 
to  expect  users  to  train  the  system  to  recognize  their  individual  voices/  Depend¬ 
ing  on  the  use  of  the  system,  however,  it  is  possible  for  a  speaker  who  is  having 
recognition  difficulty  to  enter  a  word  pattern  or  two,  which  are  then  averaged  into 
the  data  base  for  those  words. 

Speaker  independent  systems  are  required  when  the  operator  is  a  casual  user,  and 
there  is  no  requirement  to  validate  that  the  operator  is  authorized  to  perform  a 
task,  such  as  finding  a  telephone  number,  or  the  flight  weather  between  two  loca¬ 
tions.  ^  Speaker  independent  systems  can  also  be  used  in  conjunction  with  live 
operators  for  such  things  as  phone  banking  where  the  live  operator  may  first  need  to 
verify  the  user. 

Isolated  Wo r d / Conne c t ed  Word  Recognition 

Most  systems  available  today  recognize  isolated  word  speech,  which  requires  a 

distinct  pause  by  the  speaker  between  each  word  or  utterance  and  no  substantial 

pauses  within  words.  An  utterance  is  defined  as  a  word  or  sequence  of  words  restric¬ 
ted  in  time  to  between  one  and  one-half  and  three  and  one-half  seconds,  depending  on 
the  system.  Isolated  word  recognition  systems  are  useful  when  a  large  number  of 
people  must  have  access  to  the  system,  and  when  a  single  response  can  be  given  to  a 
computer  question  or  prompt.  For  other  purposes,  however,  this  speech  is  awkward  and 
unnatural;  connected  speech  may  be  a  more  viable  alternative. 

Connected  word  recognition  systems  can  recognize  a  sequence  of  words  spoken  in 
natural  cadence  and  provide  for  faster  data  entry.  Connected  speech  has  such  appli¬ 
cations  as  warehouse  routing  and  inventory  control,  and  zip  code  entry  for  mail  sort¬ 
ing  systems.  These  applications  are  restricted,  however,  and  recognition  of  unre¬ 
stricted  or  continuous  speech  is  still  experimental.  Continuous  speech  recognition 

is  discussed  later  within  the  topic  of  speech  understanding. 

The  speed  of  speech  is  more  of  a  factor  in  connected  speech  than  isolated  word 

recognition.  Usually,  the  length  of  the  sound  pattern  is  averaged  over  repeated 
entries.  However,  another  method,  called  dynamic  programming,  adjusts  the  time  bases 
of  stored  sound  patterns  to  those  of  an  unknown  utterance,  resulting  in  higher  recog¬ 
nition  accuracy. 

At  this  time  connected  speech  recognition  requires  ten  times  as  much  computation 
as  isolated  word  recognition,  and  consequently,  has  higher  costs.  Further  research 
can  bring  these  costs  down. 

Vocabula  ry  Size 

A  third  primary  characteristic  of  voice  recognizers  is  vocabulary  size.  What 

constitutes  small  or  large  vocabularies  will  vary  depending  upon  the  context  and  the 
state-of-the-art.  For  example,  an  isolated  word,  speaker  dependent  recognizer  with 

an  utterance  capacity  of  three  hundred  words  or  more  may  be  considered  large.  How¬ 
ever,  the  vocabulary  of  a  connected  speech,  speaker  independant  recognizer  may  be 

considered  large  if  it  is  greater  than  fifty  words/  Connected  word  vocabularies 
can  contain  as  many  as  one  hundred  and  twenty  words. 

The  total  vocabulary  size  for  Isolated  word  recognizers  usually  ranges  from 

twelve  to  one  thousand  or  more  words,  although  at  each  word  position  there  are  typi¬ 
cally  ten  or  fewer  words  in  the  active  vocabulary  set  from  which  the  correct  word 
must  be  chosen.  For  each  task  a  syntax  tree  effectively  determines  which  subsets  of 
the  entire  vocabulary  are  active  or  available  at  different  stages  within  the  task 
procedure  sequence  according  to  a  "menu"  format  (see  Figure  2).  As  the  number  of 
possible  word  choices  increases,  the  computer's  ability  to  discriminate  between  the 

correct  word  and  the  incorrect  words  decreases. ^ 

Vocabularies  for  connected  speech  recognizers  are  limited  due  to  the  added  com¬ 
plexity  of  sorting  words  out  of  sequences.  This  requires  more  complex  acoustic  dis¬ 
criminators.  Some  connected  speech  recognizers  use  sentence  patterns  where  the 

sentence  types  are  pre-programmed  as  in  the  IF-THEN  statement,  reducing  the  number  of 
options  for  any  position  in  the  sentence  (see  Figure  3). 

Vocabularies  may  be  expanded  to  an  almost  unlimited  size  through  the  use  of  addi¬ 
tional  data  storage  and  branching  techniques.  For  example,  (see  Figure  4)  in  a 
recognizer  with  only  a  five  word  recognition  capacity,  any  one  or  all  items  may  be 
designated  as  "control  words."  The  use  of  a  control  word  would  cause  the  system  to 
branch  to  another  program  that  contains  a  different  set  of  five  words  any  of  which 
may  again  be  control  words.®  The  use  of  any  of  these  techniques  depends  specifi¬ 
cally  on  the  training  requirement. 

Speech  Understanding 

One  step  beyond  voice  recognition  is  speech  understanding,  which  requires  that 
only  enough  of  the  key  words  in  an  utterance  be  correctly  decoded  for  an  appropriate 
response  to  be  elicited  from  the  computer,  such  as  the  retrieval  of  some  stored 


Figure  2.  Syntax  Tree  for  Phrase  "Turn  Left  Heading  One  Zero  Six  Degrees. 
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Pattern  Format  for  Sentence  "If  Bogey  Is  Less  than  Four  Miles, 
Then  Arm  Missiles  and  Advise." 


FI gure  3 . 
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Figure  4.  Branches  on  Control  Words  to  Five  Word  Vocabularies. 

Information.^  A  further  step  In  speech  understanding  would  enable  the  system  to 
employ  knowledge  of  the  real  world  and  of  human  speech  to  understand  speaker  Intent. 
This  type  of  system  requires  significant  advances  in  computer  technology,  natural 
language  processing,  and  artificial  intelligence.  Artificial  intelligence  has  been 
defined  as  "that  part  of  computer  science  that  is  concerned  with  rhe  symbol  manipula¬ 
tion  processes  that  produce  Intelligent  action."*® 

TRAINING  APPLICATIONS 


Four  military  applications  for  training  systems  will  be  addressed  here  which 
incorporate  computer  voice  technology.  In  the  first  three  systems,  the  student  would 
be  trained  in  some  form  of  air  traffic  control.  The  student  must  learn  to  process 
visual  and  auditory  information  rapidly  while  making  verbal  advisories  to  the  air¬ 
craft  involved.  Two  of  these  systems,  PARTS  and  LSOTS,  provide  control  instructions 
for  aircraft  during  landing  operations.  The  AIC  system  provides  control  during  tact¬ 
ical  maneuvers.  With  the  fourth  system,  AIDS,  the  student  would  be  trained  in  air¬ 
craft  operations  where  both  the  aircraft  and  the  trainer  >.  •'uld  incorporate  computer 
voice  t ec h n o  1  og y  .  * ^ 

PARTS 


The  Precision  Approach  Radar  Training  System  (PARTS)  for  air  traffic  controllers 
represents  the  culmination  of  work  begun  In  1972  for  the  Naval  Training  Equipment 
Center.  This  system  was  originally  called  the  Ground  Controlled  Approach  Controller 
Training  System  (GCA-CTS). 

The  earliest  work  on  this  system  identified  the  PAR  control  task  as  an  ideal  test 
bed  for  research  in  computer  voice  technology.  PAR  control  is  primarily  a  verbal 
task  not  previously  amenable  to  automated  training.  The  vocabulary  used  is  rigidly 
defined  and  highly  stylized  and  Is,  therefore,  potentially  recognizable  by  the  iso¬ 
lated  phrase  recognition  technology.  Performance  of  the  PAR  control  task  requires 
interaction  with  a  pilot,  pattern  controller,  and  tower  controller  (see  Figure  5). 
This  situation  is  also  ideally  suited  to  the  development  of  models  for  these  posi¬ 
tions  incorporating  speech  generation. 

A  series  of  laboratory  studies  involving  the  development  of  a  preliminary  train¬ 
ing  system  led  to  the  development  of  an  experimental  prototype  system.  This  system 
was  used  to  demonstrate  the  feasibility  of: 

•  Employing  the  automated  speech  technologies  in  an  operational  training 
environment. 


Developing  a  training  methodology  Incorporating  Instructorless  training  and 
automated  speech  technologies  without  compromising  training  effectiveness. 

Developing  an  Instructor  model  which  could  provide  automated  adaptive  train¬ 
ing  for  a  primarily  verbal  task. 

Devising  a  performance  measurement  scheme  which  would  enable  the  system  to 
provide  instructive  feedback  to  the  trainee,  progress  Information  to  the 
learning  supervisor,  and  Input  to  the  Instructor  model  which  would  enable 
automated  adaptive  problem  selection. 

Devising  techniques  for  providing  the  feedback  to  the  trainee  and  learning 
supervisor. 

Developing  useful  models  of  the  verbal  and  motor  behavior  of  the  other 
persons  with  whom  the  precision  approach  radar  (PAR)  controller  interacts, 
namely  the  pilot,  pattern  controller,  and  tower  controller,  as  well  as  a 
model  of  PAR  controller  behavior. 
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Figure  5.  PARTS  System  Components. 

The  primary  performance  requirements  under  which  PARTS  was  developed  Include: 

•  The  state  of  the  art  In  speaker  dependent.  Isolated  phrase  voice  recognition 
technology. 

•  Good  voice  recognition  in  real  time  over  a  relatively  large  vocabulary  con¬ 
taining  many  similar  phrases. 

•  System  visibility  by  persons  not  previously  trained  in  the  use  of  voice 
recognition  equipment. 

•  Training  In  the  PAR  control  task  equivalent  to  that  provided  In  the  existing 
training  environment  and  In  an  environment  with  minimum  instructor  interven¬ 
tion. 

•  Realistic  stimuli  such  as  radar  displays,  servo  controls,  and  communications 
equipment  to  facilitate  transfer  of  training. 


•  Training  of  the  student  to  proficiency  within  present  time  constraints. 

The  resulting  PARTS  is  a  stand-alone,  experimental,  prototype  training  system. 
PARTS  provides  automated,  individualized  instruction  in  techniques  for  providing 
gr ou nd -con t r o 1 1 ed  approaches.  In  addition,  it  Provides  a  realistic  environment  in 
which  radar  control  skills  can  be  practiced  un  c-  the  supervision  of  an  automated 
instructor.  It  also  provides  objective  performance  measurement  and  feedback  in  the 
form  of  performance  summaries  and  annotated  replays.  Although  the  order  of  topic 
presentation  is  rigidly  defined  in  the  basic  syllabus,  problem  difficulty  Is  adapted, 
amount  of  practice  is  varied,  and  remedial  exercises  are  selected  to  automatically 
adapt  the  basic  course  to  the  needs  of  the  individual  trainee.  One  of  the  major 
benefits  of  the  system  is  that  it  relieves  the  trainee  of  the  need  to  devote  part  of 
his  or  her  time  to  serving  as  a  pseudo  pilot  for  other  trainees,  a  requirement  when 
using  the  existing  training  device.  It  also  provides  enrichment  topics  for  those 
students  who  complete  the  basic  course  quickly.  This  provides  students  who  quickly 
attain  the  minumum  requirements  to  qualify  as  air  traffic  controllers  with  the 
opportunity  to  continue  with  advanced  training  topics.  Finally,  the  system  provides 
the  learning  supervisor  with  informative  feedback  regarding  the  individual  trainee's 
performance.^  An  evaluation  of  PARTS  is  reported  by  McCauley. 13 

A  I C  System 

Research  to  develop  a  voice  recognition  and  speech  understanding  system  was  init¬ 
iated  in  1977  to  support  an  automated  training  system  for  Air  Intercept  Controllers 
(AICs).  The  task  of  the  AIC  is  to  direct  an  intercept  aircraft  in  a  combat  situation 
to  destroy  an  enemy  aircraft  (see  Figure  6).  The  AIC  must  make  split-second  deci¬ 
sions  and  have  rapid,  accurate  motor  responses  in  controlling  the  aircraft  displayed 
on  a  complex  monitor.  The  research  involved  developing  a  voice  recognition  subsystem 
utilizing  new  recognition  techniques.  This  system  was  subsequently  tested  in  a 
laboratory  AIC  training  model  which  was  under  concurrent  development. 


Figure  6.  AIC  Work  Station. 


The  objectives  of  the  research  were  to: 

•  Achieve  a  greater  understanding  of  the  demands  placed  on  computer-based 
voice  recognition  by  an  automated  AIC  training  system. 

•  Determine  If  a  previously  demonstrated  limited  connected  voice  recognition 
system  could  be  effectively  combined  with  the  more  usual  Isolated  word 
recognition  systems  to  satisfy  the  AIC  requirements. 


Provide  an  applications  environment  for  the  continued  development  of  speech 
recognition  algorithms  to  provide  focus  and  ensure  the  earliest  possible 
realization  of  an  operationally  useful  recognition  capability. 
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The  primary  constraints  under  which  the  AIC  training  system  was  developed  include: 

•  The  innovative  use  of  new  technologies.  The  automated  AIC  training  problem 
represents  a  significant  advance  beyond  the  PARTS  in  both  the  application  of 
the  voice  technologies  as  well  as  training  systems  design. 

•  A  requirement  for  recognizing  a  large  amount  of  numeric  strings.  Isolated 
phrase  recognition  will  not  meet  the  recognition  requirement  of  AIC  train¬ 
ing,  and  a  complete  connected  speech  capability  is  beyond  the  current  tech¬ 
nology.  Therefore,  a  mixed  isolated  phrase  and  limited  connected  recognizer 
was  required. 

The  prototype  was  a  stand-alone  system  which  was  used  for  fleet  evaluation  for 
further  revision  and  determination  of  logistic  requirements.^  An  evaluation  of 
the  AIC  is  reported  by  McCauley. 

LSOTS 

A  Part  Task  Landing  Signal  Officer  (LSO)  Waving  Concept  Trainer  was  an  explora¬ 
tion  in  training  the  conceptual  as  opposed  to  the  perceptual  portion  of  the  LSO's 
task  of  controlling  aircraft  landings  aboard  a  carrier  (see  Figure  7).  The  LSO  must 
visually  evaluate  an  aircraft's  position  on  its  approach  to  landing.  The  LSO 
assesses  aircraft  approach  and  recovery  conditions,  directs  pilot  corrections,  and 
advises  superiors  of  recovery  feasibility,  efficiency,  and  safety.  The  LSO  is  res¬ 
ponsible  for  guiding  the  aircraft  to  a  safe  landing  or  waving  it  off  for  another 
attempt.  The  LSO  uses  a  handset  to  communicate  verbally  with  the  pilot  and  a  hand 
device  for  activating  "wave  off"  and  "cut"  lights,  requiring  the  use  of  both  hands. 
This  is  a  critical  task  requiring  fine  discriminations,  quick  decisions,  and  eye- 
mouth  coordination.  The  part-task  system  teaches  the  latter  two  functions. 
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Figure  7.  Characterization  of  the  LSO  Part  Task  Waving  Concept  Trainer. 


LSO  training  incorporates  a  large  percentage  of  on-the-job  training  (OJT),  and 
is,  consequently,  dependant  on  OJT  o pp po r t un i t i e s  .  Reduced  operational  training 
opportunities  have  created  a  severe  manpower  shortage  of  LSOs.  A  review  of  the 
required  skills  determined  that  the  training  of  LSOs  could  be  accomplished  using 
automated  training. The  proposed  LSO  Training  System  (LSOTS),  based  on  the 
part-task  system  plus  a  high  fidelity  visual  system,  visually  simulates  the  view  of 


the  aircraft  from  the  LSO  position  on  the  aircraft  carrier  (see  Figure  8).  The  air¬ 
craft  moves  in  response  to  the  student's  verbal  commands*  The  system  uses  computer 
generated  imagery  to  present  the  scenario,  a  voice  input,  and  a  keyboard  CRT  to 
present  instruction  and  performance  evaluation  information. 

AIDS 


The  Advanced  Integrated  Display  System  (AIDS)  is  a  test  bed  cockpit  trainer  for 
the  F-18  aircraft,  presenting  an  example  of  the  use  of  voice  recognition  and  speech 
generation  in  an  operational  environment  which  requires  the  same  use  in  the  trainer. 
AIDS  uses  voice  recognition  for  control  of  radio  frequencies,  TACAN  and  SIF,  and 
speech  generation  for  verification  of  commands.  For  example,  the  pilot  can  vocally 
change  the  radio  frequency,  and  the  computer  will  vocally  verify  the  new  frequency. 
This  system  holds  potential  for  gaining  information  from  the  computer.  The  pilrt 
could  a-’-  f  he  computer  to  alert  him  when  the  altitude  is  below  five  hundred  feet. 
The  cos;  ...er  would  verify  the  input  by  repeating  it.  Then  when  the  aircraft  descends 
below  five  hundred  feet,  the  voice  generator  would  notify  the  pilot.  A  more  complex 
example  has  the  pilot  telling  the  computer  to  arm  the  missiles  when  the  bogey  is 
within  ten  miles,  and  to  report  accomplishment  to  the  pilot.  The  computer  would 
verify  the  input,  track  the  bogey  to  within  ten  miles,  arm  the  missiles,  and  notify 
the  pilot. 

Computer  voice  technology  is  perceived  as  having  a  high  potential  for  military 
training  applications.  This  is  particularly  true  in  the  training  of  interactive 
skills  for  teams  where  the  computer  can  model  team  members  and  provide  performance 
measurement.  Improved  procedures  for  team  task  analyses  have  made  it  possible  to 
model  the  team,  thus  providing  a  base  for  adaptive  training  techniques.^ 
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Figure  8.  Characterization  of  proposed  LSO  Trainer  System 


Another  research  project  has  explored  the  use  of  voice  technology  as  the  Instruc¬ 
tor's  assistant.  The  overload  on  the  instructor  due  to  complex  simulator  based 
training  generated  the  requirement  for  CVT  to  aid  the  instructor.  Potential  results 
include : 

•  Decrease  in  Instructor  Requirements 

•  Decrease  in  Instructor  Workload 

•  Consistency  in  Training 

•  Repl ay / Cr i t iq ue  Capabilities 

•  Automated  Interaction  Among  Ins t rue t or /Tra i nee / Simula  tor 

CVT  can  also  reduce  manpower  requirements  in  training  by  eliminating  assistants 
and  providing  a  wider  range  of  applications  in  training. 

CVT  IN  TRAINING 

Recent  retention  rate  drops  created  shortages  which  negatively  affect  the  train¬ 
ing  environment  in  terms  of  a  lack  of  skilled  personnel  to  perform  as  instructors  and 
in  interactive  and  team  training.  Worsening  economic  conditions  which  Increase  first 
time  enlistments  put  an  additional  load  on  short  staffed  schools.  Also,  fleet  short¬ 
ages  are  Increased  due  to  slow  and/or  Inadequate  training.  Additionally,  Navy  train¬ 
ing  and  its  related  training  devices  are  becoming  increasingly  complex  and  require 
increasingly  sophisticated  skills.  Computer  voice  technology,  in  conjunction  with 
other  technologies  in  automated  training  and  training  device  situations,  can  aid  in 
the  increase  of  training  efficiency  and  can  counteract  training  personnel  shortages 
in  these  ways : 

•  simultaneous  training 

•  personnel  replacement 

•  Instructor  models 

•  interactive  and  team  models 

•  trainee  behavior  models 

•  Instructor  ass  1 8 t ance / su ppor t 

•  fidelity  to  actual  systems 

Instructor,  Interactive  and  trainee  models  are  provided  by  both  the  PARTS  and 
LSOTS  using  voice  generation  and  recognition.  PARTS  provides  a  realistic  environment 
under  the  supervision  of  an  automated  instructor.  It  also  provides  Interactive 
modeling  by  providing  the  trainee  air  traffic  controller  with  a  pilot,  pattern  con¬ 
troller  and  tower  controller  models,  thus  reducing  manpower  requirements  by  relieving 
other  personnel  from  performing  these  roles.  Another  result  could  be  a  decrease  in 
training  time  as,  often,  trainees  have  been  required  to  perform  these  roles  for  one 
another.  Training  can  also  be  improved  since  the  computer  can  provide  far  more  flex¬ 
ible  and  accurate  models  than  can  trainees.  In  addition,  PARTS  Includes  a  model  of 
trainee  behavior  which  provides  quick  and  accurate  performance  measurement,  and 
remed la  1  and  enrichment  instruction  as  needed.  Computer  voice  technology  also  pro¬ 
vides  fidelity  in  trainee  data  input  and  interactive  personnel. 

The  LSOTS  uses  CVT  to  provide  two-way  communication  between  the  pilot  model  and 
the  LSO  trainee.  The  LSOTS’  Instructor  model  could  provide  safe  practice  of  critical 
situations  with  automated  voice  critique  from  the  computer.  LSO  training  had  primar¬ 
ily  an  OJT  training  requirement;  the  LSOTS  can  reduce  training  time  and  personnel  by 
greatly  reducing  the  OJT  requirement  and  could  increase  training  safety. 

The  use  of  CVT  as  an  Instructor’s  assistant  would  reduce  instructor  busywork  and 
increase  instructor  training  effectiveness  by  facilitating  man-machine  interaction  by 
both  instructor  and  trainee.  Presently,  Instructors  often  fall  to  utilize  the  full 
potential  of  a  training  device  due  to  a  number  of  reasons  Including: 

e  Training  devices  are  becoming  more  and  more  complex. 

e  The  high  Instructor  turnover  rate  and  the  fact  that  training  required  for  an 

Instructor  to  become  fully  aware  of  a  device's  idiosyncrasies  are  often  not 
provided  for  replacement  instructors  nor  continued  past  the  initial  training 
and  acceptance  period. 

e  Instructor  busywork  chores  (note  taking,  grading,  and  equipment  monitoring) 
can  create  a  high  workload  so  that  not  all  tasks  are  completed  efficiently. 
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•  The  number  of  switches,  lights,  and  displays  often  require  an  assistant 
operator  for  trainer  utilization. 

Computer  voice  technology  can  provide  an  effective  man-machine  communication 
channel.  Coupled  with  a  "resident  Instructor"  In  the  form  of  a  computer  model,  the 
human  Instructor  would  have  access  to  a  wealth  of  "knowledge"  and  assistance  by 
having  the  capability  to  talk  or  converse  with  his  training  system  —  as  though  It 
were  an  assistant.  Automation  can  provide  more  efficient  use  of  the  device's  poten¬ 
tial,  of  the  Instructor's  time,  and  thereby,  increase  training  standardization. 

Training  payoff  with  this  system  results  from  increased  training  proficiency  due 
to  increased  training  device  utilization  and  Instructor  effectiveness.  Further,  a 
reduction  in  the  number  of  Instructors  can  be  realized  from  the  capability  of  the 

trainee  to  interact  directly  with  the  Instructor's  assistant  via  computer  voice  tech¬ 
nology.  Thus,  fewer  human  Instructors  can  handle  more  trainees  since  many  of  the 

chores  could  be  handled  by  the  automated  assistant.  The  trainee  can  make  direct 
requests  to  the  system  for  information,  advice,  or  assistance  via  the  voice  channel, 

while  the  human  instructor  is  free  to  concentrate  on  those  trainees  requiring  more 

detailed  assistance. 

The  use  of  CVT  in  an  operational  system,  such  as  the  projected  application  in  the 
F-18  aircraft,  will  result  in  a  requirement  for  CVT  in  the  training  device  to  provide 
fidelity  to  the  actual  system  for  F-18  pilot  training. 

There  are  two  conditions  which  facilitate  the  application  of  CVT  to  military 
training.  First,  a  high  percentage  of  training  devices  are  computerized,  a  necessary 
condition  for  CVT.  Second,  many  non-training  device  applications  would  be  trained 
with  computer-based  instruction.  In  this  case  the  use  of  CVT  would  not  be  driven  by 
a  task  speech  requirement  but  by  a  need  for  a  data  entry  method  other  than  keyboard 
or  light  pen.  This  could  be  due  to  environmental  effects  or  to  eliminate  interfer¬ 
ence  with  the  program.  Another  non-training  device  use  of  CVT  occurs  when  the 
required  information  cannot  be  presented  by  print  due  to  low  reading  levels  or  envir¬ 
onmental  conditions,  such  as  weather,  poor  light,  or  motion,  which  eliminate  other 
presentation  methods. 

Advantages  Of  CVT 

The  advantages  of  computer  voice  technology  are  broad-ranging  and  can  provide 
fidelity  and  flexibility  to  computer  use.  CVT  can  model  Instructors,  reducing 
instructor  workloads  and  personnel  requirements.  CVT  can  provide  more  realistic 
training  environments  and  increase  student  interest  by  providing  models  of  required 
personnel  such  as  team  members.  Data  indicate  that  the  prototype  systems  evaluated 
to  date  train  no  worse  than  traditional  training.  Voice  technology  has  been  shown  to 
be  particularly  advantageous  to  Industry  when  one  or  more  of  the  following  conditions 
apply: 

•  The  worker's  hands  are  busy. 

•  Mobility  is  required  during  the  data  entry  process. 

•  The  worker's  eyes  must  remain  fixed  upon  a  display,  an  optical  instrument, 
or  some  object  to  be  tracked. 

•  The  environment  is  too  harsh  to  allow  use  of  a  keyboard. 

For  example,  a  pilot  needs  his  or  her  hands  and  eyes  for  actual  flight.  Voice 
technology  can  free  the  pilot  from  manual  data  entry  and  minimize  the  number  of 
gauges  to  be  viewed.  In  other  applications,  persons  can  directly  access  computer 
data  by  phone  with  CVT.  Voice  recognition  can  also  improve  the  speed  and  accuracy  of 
entering  data  into  a  computer.  Many  available  voice  components  are  relatively  easy 
to  interface  with  recommended  computer  hardware  and  software.  Programming  for  voice 
recognition  can  usually  be  done  in  standard  simple  languages,  such  as  Fortran  and 
BASIC . 

Disadvantages  Of  CVT 


The  primary  disadvantages  of  CVT  are,  at  this  time,  cost  and  user  acceptance. 
Other  disadvantages  are  more  accurately  classified  as  subjects  of  current  research, 
as  discussed  below: 

•  Speaker  dependency  is  a  major  recognition  disadvantage.  Research  in  phonol¬ 
ogy  and  acoustics  Is  being  conducted  to  reduce  or  eliminate  speaker  depen¬ 
dency  in  a  cost-effective  manner. 

•  Language  constraints  such  as  non-  and  nlsrecognltlon  negatively  affect  the 
user  and  his  performance.  Research  geared  towards  establishing  larger,  more 
flexible  and  easily  recognizable  vocabularies  will  reduce  language  con¬ 
straints. 


•  The  software  needed  to  operate  voice  systems  must  be  developed  for  each 
application.  Expanded  use  of  CVT  should  provide  off-the-shelf  software. 
The  integration  of  voice  capabilities  into  existing  and  planned  operational 
and  training  systems  presents  interface  complexities  and  requires  refinement. 

•  Another  disadvantage  is  the  need  to  train  and  retrain  the  computer  to  accept 
individual  voice  patterns.  Research  to  individualize  for  each  talker  a  set 
of  general  patterns  could  speed  this  process. 

•  There  may  also  be  a  requirement  to  train  the  individual  to  somewhat  unnatu¬ 
ral  speech  patterns.  Research  in  different  methods  of  identifying  utter¬ 
ances  can  eliminate  isolated  word  recognizers  and  the  need  for  unnatural 
speech. 

•  Currently,  there  is  no  validated  'cookbook'  for  resolving  the  specific  man- 
machine  interface  requirements.  The  Naval  Training  Equipment  Center  has, 
however,  sponsored  the  development  of  an  engineering  guide. I® 

Other  Technology  and  CVT 

Research  in  the  CVT  field  can,  and  in  many  cases,  has  Included  Interfacing 
computer  voice  technology  with  other  technologies  such  as  video  disc,  large  screen 
displays,  video  gaming  and  artificial  intelligence.  Students  can  have  voice  response 
to,  and  control  over,  video  disc  systems  during  instruction.  This  mode  of  response 
entry  is  preferable  to  keyboard  or  light  pen,  especially  when  the  student’s  hands  are 
involved  in  other  tasks. 

Voice  recognition  can  be  used  in  instructional  situations  to  control  large  screen 
displays.  One  example  is  of  a  cockpit  trainer  using  a  one-hundred  and-eighty-degree 
computer-generated-imagery  large  screen  display.  The  pilot-trainee  communicates  with 
the  computer  operator  who  then  controls  the  display.  Voice  recognition  could 
decrease  the  training  time  spent,  as  the  display  would  be  changed  more  rapidly  and 
would  release  the  computer  operator's  time.  Another  example  is  an  air  traffic 
control  trainer  with  a  one-hundred-and-twenty-degree  display  using  slides  and  16mm 
projectors.  This  trainer  has  four  computer  terminals  requiring  five  instructor- 
operators  and  two  or  three  more  Instructors  in  the  "tower"  to  train  three  students. 
The  instructor-operators  act  as  pilots  and  program  their  aircrafts'  movements  across 
the  display.  Voice  recognition  and  speech  generation  could  replace  the  five 
instructor-operators  in  this  case. 

Students  can  have  voice  interaction  with  trainers  using  video  gaming  techniques, 
such  as  In  the  Combat  Engineer  Vehicle  Trainer.  In  this  trainer  the  student  would 
vocally  command  team  members  and  vehicle  actions  which  he  views  on  the  video  dis¬ 
play.  Voice  here  provides  greater  fidelity  to  actual  command  situations. 

Currently,  computer  models  of  the  listener  constitute  attempts  to  duplicate 
reasoning  processes.  This  Is  done  by  providing  the  computer  with  some  form  of  higher 
level  knowledge  sources  which  may  be  considered  synonymous  with  artificial  intelli¬ 
gence.  A  computer  system  that  Incorporates  the  factual  knowledge  of  linguistics  and 
of  the  language  plus  the  heuristic  knowledge  of  a  human  listener  would  constitute  an 
artificial  intelligence  system  capable  of  achieving  a  high  degree  of  voice  recogni¬ 
tion  accuracy  as  well  as  speech  understanding.^ 

ISP 

The  Interservice  Procedures  for  Instructional  Systems  Development,  or  ISP  Model, 
has  been  modified  by  the  various  services  to  suit  their  respective  needs.  The  Air 
Force  has  expanded  the  model  to  Incorporate  training  devices,  and  the  Navy  has 
extended  the  model  to  include  speech  system  design.  Since  training  device  design  and 
development  is  largely  an  engineering  task,  the  primary  Involvement  of  the  instruc¬ 
tional  developer  is  in  analysis  and  pre-engineering  design  phases.  It  is  in  these 
phases  where  the  initial  device  design  concept  is  determined  based  on  an  analysis  of 
the  tasks.  The  device  design  concept  includes  the  functions  and  features  required 
for  the  device  to  Instruct  the  learner  to  perform  the  tasks.  This  concept  is  then 
turned  over  to  the  engineers  who  develop  the  design  specification  for  the  device. 

The  ISD  model  notes  five  phases  in  instructional  development  -  analysis,  design, 
development,  Implementation,  and  control.  The  analysis  phase  consists  of,  among 
other  things,  job  analysis,  task  selection,  Job  performance  measure  construction, 
existing  course  analysis,  and  instructional  setting  selection.  When  analyzing  a  job 
to  determine  job  requirements,  the  developer  may  cycle  through  these  steps  several 
times.  This  cyclic  analysis  is  critical  in  determining  the  features  of  a  complex 
training  device.  The  determination  of  how  to  satisfy  certain  training  needs  will 
depend  on  the  state  of  the  art,  costs,  and  the  integration  complexities  of  various 
technologies . 


A  needs  analysis  Identifies  a  discrepancy  between  what  is  and  what  ought  to  be. 
Should  It  be  determined  that  the  discrepancy  requires  a  training  solution,  the  ISP 
model  would  be  utilized.  First,  the  job  is  defined  and  broken  into  validated  tasks 
with  conditions,  cues,  standards  and  elements.  Tasks  are  then  selected  for  training 
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using  performance,  criticality,  and  timing  data.  In  an  environment  where  a  training 
device  Is  not  a  consideration,  the  developer  determines  job  performance  requirements, 
reviews  existing  training  courses,  determines  the  instructional  setting,  and  moves  on 
to  the  design  phase.  Although  each  of  these  steps  Is  complex,  the  addition  of  train¬ 
ing  devices  further  complicates  the  process. 

If  a  device  is  a  consideration,  this  is  determined  at  the  task  selection  stage. 
Once  the  training  tasks  have  been  selected,  they  must  be  divided  into  those  which 
require  hands-on  training  and  those  which  do  not.  The  process  requires  looking  at 
the  skills  and  knowledge  related  to  the  task  elements.  The  traiulng  device  tasks 
must  then  be  analyzed  to  determine  the  Instructional  features  needed  on  the  device  to 
effectively  train  the  student.  Instructional  features  are  those  features  of  the 
trainer  which  are  involved  in  training.  The  selection  of  instructional  features  is 
dependent  on  the  nature  of  the  trainer  -  maintenance,  operator,  scenario,  or  part- 
task  -  on  the  performance  environment  and  on  the  required  fidelity.  Features  include 
cues,  feedback,  performance  measurement,  malfunctions,  dials,  indicators,  scopes, 
screens,  time,  controls,  speech,  motion,  visuals,  audio,  data  Inputs  and  outputs,  and 
the  manipulation  of  the  features  themselves.  The  selected  features  must  be  looked  at 
in  relationship  to  one  another,  and  within  the  constraints  of  the  program.  This 
means  consideration  of  integration,  costs,  time,  and  technology  state-of-the-art.  A 
realistic  preliminary  device  description  can  then  be  made  to  be  turned  over  to  the 
engineers . 

CVT  In  ISD 


More  specifically,  the  determination  of  a  CVT  requirement  can  be  shown  as  a  sub¬ 
routine  of  the  instructional  development  model.  The  Navy  sponsored  the  development 
of  an  engineering  design  guide  for  CVT  which  forms  the  basis  for  this  subroutine  and 
details  voice  system  design  with  two  e xc e p t ion s . 2 1  This  guide  does  not  specify  the 
factors  which  signal  a  speech  requirement  nor  the  integration  of  CVT  with  other 
computerized  technologies. 

When  selecting  tasks  which  can  he  trained  using  training  devices,  one  of  the 
selection  criteria  is  a  speech  requirement.  The  following  situations  indicate  the 
need  for  a  training  device  with  a  speech  generation  and/or  voice  recognition  capa¬ 
bility: 

•  In  tasks  where  CVT  is  used  in  the  operational  or  maintenance  environment. 
In  systems  where  voice  Is  used  to  control  the  computer  and/or  to  issue 
information  from  the  computer,  CVT  is  required  for  training. 

•  In  speech  related  tasks,  where  vocal  advisories  or  commands  are  required,  or 

where  there  is  voice  interaction.  Included  are  such  tasks  as  air  traffic 

control,  air  intercept  control,  and  landing  signal  officer. 

•  For  data  entry  in  computerized  training  systems  when  the  hands  and  eyes  are 
otherwise  occupied,  as  in  air  traffic  control  and  vehicle  or  aircraft  opera¬ 
tion  training. 

•  Where  voice  is  a  modality  for  measuring  performance.  Voice  recognition  can 
be  used  to  measure  a  student's  performance  in  primarily  vocal  tasks  such  as 
air  traffic  control,  and  command  and  control. 

•  To  stimulate  verbal  communications.  Speech  generation  can  ask  questions  or 

give  directions  which  require  verbal  responses  from  the  student. 

•  When  voice  feedback  is  required.  Speech  generation  can  provide  fast  and 

natural  vocal  feedback  as  needed. 

•  When  the  instructor  has  a  heavy  workload  in  equipment  set-up,  CVT  can  aid 

instructors  in  dealing  with  complex  trainers  or  other  non-i ns t rue t i ona  1 
burdens.  CVT  can  provide  a  computerized  instructor's  assistant  to  aid  in 

teaching  and/or  provide  administrative  assistance. 

•  When  the  training  task  requires  high  instructor  interaction  with  the 

student,  CVT  can  provide  an  Instructor  model  and  reduce  the  manpower 
required  for  teaching  certain  topics. 

•  For  team  training  when  all  members  of  a  team  cannot  be  trained  at  once. 

This  eliminates  the  need  for  additional  manpower  for  role-playing.  This 

also  allows  stricter  control  of  the  Instructional  situation  by  allowing 
control  over  and  variance  of  the  performance  of  the  modeled  team  members. 

•  For  heads-up,  hands-busy  situations  such  as  maintenance  training.  When  the 

student's  hands  are  busy  with  the  maintenance  task,  he  can  control  the 

instruction  vocally.  The  student  can  request  the  next  frame  or  ask  for  a 

repeat  of  a  frame.  If  his  eyes  are  also  busy,  speech  generation  can  direct 
the  task.  Voice  recognition  can  also  be  employed  where  keyboard  operation 
by  the  student  is  not  desired,  and  light  pen  is  not  an  acceptable  solution. 
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In  addition,  both  performance  measurement  and  environment  Issues  must  be  con¬ 
sidered  in  determining  a  speech  requirement. 

Once  the  speech  tasks  have  been  selected,  the  determination  of  the  nature  and 
feasibility  of  CVT  as  a  speech  alternative  can  begin.  This  step  would  be  concurrent 
with  the  same  procedures  for  other  computerized  technologies.  In  fact,  the  feasi¬ 
bility  determination  will  also  require  consideration  of  all  computer  technologies  as 
a  unit  in  order  to  assess  integration  problems  and  associated  hardware  and  software 
costs  . 

Since  the  Navy  CVT  design  guides  are  for  engineers  and  consequently,  very 
detailed,  the  developer  should  perform  a  cursory  analysis  to  determine  the  feasi¬ 
bility  of  using  CVT.  If  feasible,  the  developer  and/or  the  engineer  can  then  perform 
the  more  detailed  analysis.  A  computer  speech  technology  specialist  should  be 
Involved  during  this  process.  These  are  the  steps  in  CVT  system  design: 

1.  Establish  Vocabulary 

2.  Identify  Voice  Technology  System  Design  Requirements 

3.  Determine  Voice  Technology  S t a t e-o f - t h e-Ar t 

4.  Project  Voice  Technology  Capability 

5.  Make  Design  Decisions 

6.  Develop  Operating  and  Human  Factors  Design 

7.  Develop  Voice  System  Design  Requirements  Specifications 

These  steps  are  the  same  for  both  speech  generation  and  voice  recognition,  al¬ 
though  the  procedures  for  steps  two,  three  and  five  differ.  Steps  one,  four,  six  and 
seven  should  be  carried  out  concurrently  when  considering  both  CVR  and  CSG.  For  our 
purpose,  voice  recognition  will  be  considered  first. 

A  surface  analysis  of  CVR  requires  consideration  of  the  following  factors: 

e  Isolated  vs.  Connected  Recognition 

e  Vocabulary  size 

•  Speaker  Dependency 

e  Task  Criticality 

•  Voice  Collection 

•  Environment 

These  factors  must  then  be  related  to  the  state-of-the-art  to  determine  if  the 
technology  exists  and  if  so,  related  costs.  If  the  system  seems  feasible,  then  the 
more  detailed  analysis  must  be  performed,  resulting  in  a  speech  system  requirements 
analyst  s . 

Now  let's  consider  design  of  computer  speech  generation.  The  same  general  proce¬ 
dures  apply. 

A  cursory  analysis  of  CSG  should  look  at  the  required  vocabulary  size,  the  number 
of  speaker  voices,  the  required  voice  quality  and  the  complexity  of  transmissions. 

These  can  be  generally  related  to  the  state-of-the-art  to  determine  If  the  technology 
exists  and  if  so,  the  related  costs.  If  the  system  seems  feasible,  then  a  more 

detailed  analysis  must  be  performed  resulting  in  a  speech  system  requirements  specif¬ 
ication. 

Once  the  voice  system  has  been  determined  to  be  feasible,  the  process  returns  to 
the  training  device  development  model  where  making  design  decisions  and  developing 
operating  and  human  factors  designs  actually  occur  in  conjunction  with  the  same  anal¬ 
ysis  at  for  the  device.  These  functions  are  particularly  critical  for  highly  complex 
technological  devices,  as  the  developer  must  also  consider  the  integration  of 

technologies.  A  system  may  also  require  a  highly  complex  and  flexible  visual  system 

such  as  utilizing  video  discs.  For  example,  both  the  hardware  integration  and  soft¬ 
ware  design  issues  must  be  considered  before  finalizing  the  system  specification  for 

either  the  video  disc  or  speech  systems.  In  addition,  technology  advances  must  be 

determined  and  planning  completed  prior  to  matching  the  release  of  these  advances  to 

the  device  production  schedule. 

Concurrent  with  the  design  of  the  training  device  is  the  design  of  the  rest  of 
the  training  system.  The  ISD  model  design  phase  determines  objectives,  tests,  entry 
behavior,  and  the  sequence  and  structure  of  the  training  to  Include  use  of  the  train¬ 
ing  device.  The  rest  of  the  ISD  process  -  development,  Implementation  «:J  <ntrol  - 


should  occur  concurrently  and  Interactively  with  the  actual  development  of  the  train¬ 
ing  device.  Course  development  must  accommodate  the  training  at  all  times  so  that  the 
resulting  training  is  integrated,  effective,  and  efficient. 

Human  Factors 

In  designing  instruction  using  CVT,  there  are  many  human  factors  issues  to 
consider. 22  The  primary  issues  are  as  follows: 

•  Validation  -  machine  training,  retraining,  modeling  of  voice  technology 

•  User  frustration,  stress,  fatigue  and  boredom 

•  User  task  training 

•  User  system  acceptance 

•  Environment 

In  speaker  dependent  systems,  the  speaker  must  input  from  one  to  ten  samples  of 
each  utterance  depending  upon  the  hardware  requirements  of  the  system  selected,  so 
that  the  system  can  recognize  the  user's  words.  The  newly-collected  utterances 
should  be  validated  before  continuing.  With  a  large  vocabulary,  training  the  comput¬ 
er  can  be  a  tedious  task  for  the  user.  This  redundancy  can  be  minimized  by  having 
the  user  train  utterances  as  needed  throughout  the  actual  task  training  or  by  select¬ 
ing  a  device  requiring  few  voice  samples,  although  such  devices  are  usually  more 
expensive.  In  addition,  either  the  user  or  the  system  must  be  able  to  recognize  when 
retraining  may  be  needed,  primarily  in  situations  of  non-  and  misrecognition.  If  the 
system  is  being  used  over  an  extended  period  of  time,  the  user  may  want  to  enter  one 
sample  of  each  word  every  day  or  at  each  use.  The  system  should  be  designed  to  cue 
the  user  to  retrain.  This  type  of  trainee  feedback  leads  to  internalization  by  the 
student  of  the  concepts  of  how  speech  is  recognized  by  computers. 

Misrecognition,  unnatural  speech  patterns,  -machine  training,  the  environment,  and 
the  task  itself  can  all  lead  to  user  frustration,  stress,  fatigue  and  boredom.  All 
of  these  can,  in  turn,  effect  the  user's  voice  and  cause  recognition  difficulties. 
Therefore,  the  designer  must  take  care  to  convey  to  the  student  the  significance  of 

these  factors.  Then,  when  the  student  can  recognize  these  factors,  internalization 

is  beginning. 

One  of  the  most  important  human  factors  issues  is  user  system  acceptance.  In 
CSC,  the  user  must  be  able  to  relate  to  the  speech  quality  of  the  computer.  In  a 
simple,  non-critlcal  task  a  robotic  sounding  voice  may  be  acceptable  and  perhaps, 
even  Interesting.  But  In  a  complex,  critical  task,  such  as  AIC  training,  robotic 
disjointed  speech  can  be  distracting  and  frustrating  when  the  user  needs  to  hear 
natural-sounding  responses  from  the  pilot  model.  Therefore,  the  user  may  need  to 
listen  to  the  computerized  speech  before  actual  task  training  begins  in  order  to 
familiarize  himself  with  its  sound  and  to  minimize  any  distraction.  In  phoneme 
generated  speech,  familiarization  can  increase  the  user's  level  of  comprehension.  In 
CVR  the  user  must  be  able  to  relate  comfortably  to  speaking  to  a  machine.  The  extent 
of  the  user  training  required  depends  on  the  system  used.  Isolated  word  recognition 
systems  require  significant  pauses  between  words  or  phrases  which  may  be  unnatural 
for  the  speaking  requirement.  Easily  confused  words,  if  they  must  be  used,  may 
require  using  a  different  verbal  word,  such  as  "execute"  for  "run"  which  the  user 

must  remember.  The  user  must  also  try  to  keep  his  voice  consistent  to  achieve  the 

best  recognition.  In  CVR  the  user  must  also  be  trained  to  use  and  train  the  system 
Itself.  The  user  may  also  either  over-  or  underestimate  the  capabilities  of  a  CVT 
system.  The  user  may  see  it  as  a  "Star  Wars"  system  which  is  intelligent  and  can 
understand  anything  that  is  said,  then  become  frustrated  when  it  doesn't  react  like  a 
human.  On  the  other  hand,  the  user  may  not  believe  the  system's  capabilities  and  may 
not  use  it  to  its  full  extent.  Both  of  these  situations  can  be  handled  with  user 
training  in  CVT,  especially,  if  the  training  incorporates  a  "user  friendly"  design 
which  will  be  discussed  later. 

The  environment  must  also  be  considered  in  using  CVT.  A  noisy  environment,  such 
as  machinery  or  aircraft  noise,  may  require  repeated  or  delayed  inputs.  Another 
noise  problem  can  result  from  the  user  in  terms  of  coughing,  sneezing,  or  throat 
clearing.  The  computer  will  attempt  to  recognize  these  sounds  as  words.  Certain 
types  of  earphones  or  microphones  may  be  required  to  help  minimize  noise  problems. 


CVT  Des ien 


Once  the  technical  content  of  the  training  is  organized  including  the  use  of  CVT, 
the  CVT  portions  can  be  designed  for  optimal  use  and  acceptance.  At  this  stage  the 
system  hardware  has  been  defined  Including  microphones  and  earphones.  The  instruc¬ 
tional  developer  must  then  design  the  instruction  for  the  software  developer.  The 
instructional  developer  has  all  of  the  usual  considerations  -  cues,  feedback,  prompt¬ 
ing,  exercises,  review,  branching,  and  so  on.  For  CVT  he  must  also  determine  how  and 
when  to  train  and  retrain  vocabulary,  model  the  Instructor,  design  CVT  feedback  and 
cues,  and  provide  user  training. 


The  use  of  connected  or  Isolated  recognition  should  have  been  determined  by  the 
design  phase.  The  type  of  voice  Input  required  must  be  Incorporated  Into  the 
design.  The  user  must  also  be  trained  to  pause  between  words  or  phrases  or  to  struc¬ 
ture  his  Inputs  In  certain  formats. 

The  user  must  also  be  trained  to  train  the  machine  with  the  necessary  vocabu¬ 
lary.  The  design  should  Incorporate  vocabulary  training  as  needed  throughout  the 
actual  task  training.  The  user  should  also  receive  some  training  in  CVT  as  needed 
during  actual  task  training.  If  the  user  has  a  special  requirement  or  if  the  train¬ 
ing  task  is  particularly  complex,  then  the  developer  may  want  to  consider  a  job  aid. 
This  technique  can  be  accomplished  either  via  a  handout  or  via  software,  depending 
upon  cost  consideration.  The  job  aid  would  include  such  Information  as  training, 
retraining  and  pausing.  User  training  should  include  training  the  system,  recogniz¬ 
ing  unnatural  voice  patterns,  pausing,  and  becoming  familiar  with  vocal  noises,  such 

as  coughing,  sneezing,  and  throat  clearing. 

The  design  of  the  task  training  must  take  into  consideration  the  constraints  of 
the  technologies  being  used  to  support  the  training.  Complex  task  training  could  be 
beyond  the  capabilities  of  the  state-of-the-art  of  the  technologies  or  could  require 
highly  complex,  expensive  programming  and  interfaces. 

Both  the  user  training  and  the  task  training  can  be  made  very  acceptable  to  the 
user  if  the  system  can  accommodate  computer  feedback  either  visually  or  by  computer 
speech  generation.  The  computer  can  tell  the  user  where  his  recognition  problems 

lie.  For  exampl e : 

1.  The  computer  can  list  several  words  which  were  closest  to  the  m i s r ec ogni z ed 

word  and  give  a  score  for  each  possible  choice.  The  score  might  be  computed 
from  one  to  10,  with  10  being  the  most  recognizable.  For  example,  if  the 
misrecognized  word  is  'four,'  the  computer  can  say,  "What  you  said  sounded 

like  'door'  -  8.3,  'pour'  -  8.1,  'four'  -  7.8,  and  'floor'  -  7.4."  This 

helps  the  user  identify  areas  of  mismatch  and  encourages  him  to  retrain. 

The  computer,  in  fact,  might  recommend  retraining  within  preset  parameters. 
It  also  helps  the  user  develop  a  model  of  what  a  computer  'ear'  hears  when  a 
human  speaks. 

2.  The  computer  can  determine  if  mi  s  r  ecogn  i  t  i  on  is  due  to  too  loud  or  too  soft 
speech  and  tell  the  user.  If  the  user  finds  this  volume  to  be  normal,  he 
may  wish  to  retrain  the  computer. 

3.  The  computer  can  be  programmed  to  reject  or  not  recognize  any  word  which 
falls  below  a  certain  score  or  acceptance  threshhold.  The  computer  can  say 
"What  you  said  was  scored  4.1,  which  is  below  the  acceptance  threshhold  of 
6.5." 

The  feedback  can  be  very  Informal  and  chatty  if  this  is  acceptable,  although  this 
could  be  expensive  in  terms  of  additional  software. 

The  vocabulary  size  and  type  of  system  will,  to  a  large  part,  determine  the 
design.  If  a  task  requires  a  small  vocabulary  and  connected  speech  recognition, 
these  requirements  could  be  trained  prior  to  the  onset  of  actual  task  training.  How¬ 
ever,  training  the  computer  as  needed  during  the  task  (in-context)  is  probably  the 
most  acceptable  to  the  user. 

Points  to  consider  when  designing  computer  voice  technology  training  follow: 

•  Encourage  warm-up,  relaxing,  breaks,  practice,  and  retraining. 

•  Emphasize  consistency. 

•  Prompt  the  user  in  saying  the  words.  If  CSG  is  used,  this  can  be  done 
vocally. 

•  Allow  for  as  much  in-context  training  as  possible. 

•  Introduce  the  vocabulary  as  needed. 

•  Keep  repetition  to  a  minimum. 

•  Validate  newly  collected  words  or  phrases  before  proceeding.  Retrain  if  a 
mi sr ecogni 1 1  on  is  detected. 

•  Update  voice  samples  during  the  training  procedures. 

•  Discourage  extraneous  noise  such  as  coughing,  sneezing,  and  throat  clearing. 

•  Tradeoff  between  non-  and  mi s recogni 1 1  on  through  software  control. 


Computer  voice  technology  is  a  viable  training  component  which  is  rapidly  growing 
in  its  ability  to  provide  flexibility,  fidelity  and  sophistication  to  the  training 
environment.  The  instructional  designer  can  maintain  control  over  the  use  of  new 
technologies  such  as  CVT  in  training  by  achieving  an  awareness  and  facility  as 
offeredbythispaper. 
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